Explore the CAZome of Pectobacteriaceae genomes¶

This notebook explores the size and composition of 660 Genbank Pectobacteriaceae CAZomes.

Table of Contents¶

  1. Imports
    • Load packages
    • Load in data
  2. CAZome size
    • Compare the number of CAZymes
    • Compare the proportion of the proteome represented by the CAZomes
  3. CAZy classes
    • The number of CAZymes per CAZy class
    • Mean (+/- SD) number of CAZymes per CAZy class per genus
  4. CAZy families
    • Calculate CAZy family frequencies per genome
    • Plot a clustermap of CAZy family frequencies
  5. Core CAZome
    • Identify families that are present in all genomes
    • Calculate the frequency of families in the core CAZome
    • Pectobacterium core CAZome
    • Dickeya
  6. Always co-occurring families
    • Identify CAZy families that are always present in the genome together
    • Explore across all Pectobacterium and Dickeya genomes, and per genus
    • Build an upset plot of co-occurring CAZy families
    • Compile a matrix with the indcidence data for each group of co-occurring CAZy families
  7. Principal Component Analysis (PCA)
    • Explore the association between the host range, global distribution and composition of the CAZome
    • Explore across all Pectobacterium and Dickeya genomes, and per genus

0. Imports¶

Packages¶

In [1]:
!pip3 install -e /home/emmah/eastbio_storage/cazomevolve_dev/
Obtaining file:///home/emmah/eastbio_storage/cazomevolve_dev
  Preparing metadata (setup.py) ... done
Requirement already satisfied: adjustText in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (0.8)
Requirement already satisfied: cazy_webscraper in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (2.2.7)
Requirement already satisfied: biopython in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (1.81)
Requirement already satisfied: pandas in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (2.0.0)
Requirement already satisfied: tqdm in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (4.65.0)
Requirement already satisfied: sklearn in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (0.0.post1)
Requirement already satisfied: scikit-learn in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (1.2.2)
Requirement already satisfied: saintbioutils in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (0.0.24)
Requirement already satisfied: numpy in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (1.24.2)
Requirement already satisfied: seaborn in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (0.12.2)
Requirement already satisfied: sqlalchemy in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (1.4.20)
Requirement already satisfied: upsetplot in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (0.8.0)
Requirement already satisfied: scipy in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (1.10.1)
Requirement already satisfied: jupyter in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazomevolve==0.0.4) (1.0.0)
Requirement already satisfied: matplotlib in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from adjustText->cazomevolve==0.0.4) (3.7.1)
Requirement already satisfied: bioservices>=1.10.4 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazy_webscraper->cazomevolve==0.0.4) (1.11.2)
Requirement already satisfied: pyyaml in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazy_webscraper->cazomevolve==0.0.4) (6.0)
Requirement already satisfied: requests in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazy_webscraper->cazomevolve==0.0.4) (2.28.2)
Requirement already satisfied: mechanicalsoup in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cazy_webscraper->cazomevolve==0.0.4) (1.2.0)
Requirement already satisfied: greenlet!=0.4.17 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from sqlalchemy->cazomevolve==0.0.4) (2.0.2)
Requirement already satisfied: tzdata>=2022.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from pandas->cazomevolve==0.0.4) (2023.3)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from pandas->cazomevolve==0.0.4) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from pandas->cazomevolve==0.0.4) (2023.3)
Requirement already satisfied: nbconvert in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter->cazomevolve==0.0.4) (6.5.4)
Requirement already satisfied: ipywidgets in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter->cazomevolve==0.0.4) (8.0.4)
Requirement already satisfied: notebook in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter->cazomevolve==0.0.4) (6.5.3)
Requirement already satisfied: jupyter-console in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter->cazomevolve==0.0.4) (6.4.3)
Requirement already satisfied: ipykernel in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter->cazomevolve==0.0.4) (5.5.5)
Requirement already satisfied: qtconsole in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter->cazomevolve==0.0.4) (5.4.0)
Requirement already satisfied: joblib>=1.1.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from scikit-learn->cazomevolve==0.0.4) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from scikit-learn->cazomevolve==0.0.4) (3.1.0)
Requirement already satisfied: grequests in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (0.6.0)
Requirement already satisfied: appdirs in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (1.4.4)
Requirement already satisfied: suds-community>=0.7 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (1.1.2)
Requirement already satisfied: wrapt in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (1.15.0)
Requirement already satisfied: beautifulsoup4 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (4.12.2)
Requirement already satisfied: colorlog in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (6.7.0)
Requirement already satisfied: xmltodict in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (0.13.0)
Requirement already satisfied: requests-cache in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (1.0.1)
Requirement already satisfied: easydev>=0.12.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (0.12.1)
Requirement already satisfied: lxml in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (4.9.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (1.4.4)
Requirement already satisfied: fonttools>=4.22.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (4.39.3)
Requirement already satisfied: pillow>=6.2.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (9.5.0)
Requirement already satisfied: importlib-resources>=3.2.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (5.12.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (1.0.7)
Requirement already satisfied: pyparsing>=2.3.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (3.0.9)
Requirement already satisfied: packaging>=20.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (23.0)
Requirement already satisfied: cycler>=0.10 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from matplotlib->adjustText->cazomevolve==0.0.4) (0.11.0)
Requirement already satisfied: six>=1.5 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas->cazomevolve==0.0.4) (1.16.0)
Requirement already satisfied: traitlets>=4.1.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipykernel->jupyter->cazomevolve==0.0.4) (5.9.0)
Requirement already satisfied: jupyter-client in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipykernel->jupyter->cazomevolve==0.0.4) (7.0.6)
Requirement already satisfied: tornado>=4.2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipykernel->jupyter->cazomevolve==0.0.4) (6.1)
Requirement already satisfied: ipython>=5.0.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipykernel->jupyter->cazomevolve==0.0.4) (8.12.0)
Requirement already satisfied: jupyterlab-widgets~=3.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipywidgets->jupyter->cazomevolve==0.0.4) (3.0.5)
Requirement already satisfied: widgetsnbextension~=4.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipywidgets->jupyter->cazomevolve==0.0.4) (4.0.5)
Requirement already satisfied: pygments in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter-console->jupyter->cazomevolve==0.0.4) (2.15.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter-console->jupyter->cazomevolve==0.0.4) (3.0.38)
Requirement already satisfied: certifi>=2017.4.17 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from requests->cazy_webscraper->cazomevolve==0.0.4) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from requests->cazy_webscraper->cazomevolve==0.0.4) (1.26.15)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from requests->cazy_webscraper->cazomevolve==0.0.4) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from requests->cazy_webscraper->cazomevolve==0.0.4) (3.4)
Requirement already satisfied: entrypoints>=0.2.2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (0.4)
Requirement already satisfied: jinja2>=3.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (3.1.2)
Requirement already satisfied: jupyterlab-pygments in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (0.2.2)
Requirement already satisfied: mistune<2,>=0.8.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (0.8.4)
Requirement already satisfied: nbformat>=5.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (5.8.0)
Requirement already satisfied: jupyter-core>=4.7 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (5.3.0)
Requirement already satisfied: nbclient>=0.5.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (0.7.3)
Requirement already satisfied: MarkupSafe>=2.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (2.1.1)
Requirement already satisfied: defusedxml in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (0.7.1)
Requirement already satisfied: bleach in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (6.0.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (1.5.0)
Requirement already satisfied: tinycss2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbconvert->jupyter->cazomevolve==0.0.4) (1.2.1)
Requirement already satisfied: terminado>=0.8.3 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (0.17.1)
Requirement already satisfied: pyzmq>=17 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (19.0.2)
Requirement already satisfied: nest-asyncio>=1.5 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (1.5.6)
Requirement already satisfied: nbclassic>=0.4.7 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (0.5.4)
Requirement already satisfied: Send2Trash>=1.8.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (1.8.0)
Requirement already satisfied: prometheus-client in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (0.16.0)
Requirement already satisfied: argon2-cffi in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (21.3.0)
Requirement already satisfied: ipython-genutils in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from notebook->jupyter->cazomevolve==0.0.4) (0.2.0)
Requirement already satisfied: qtpy>=2.0.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from qtconsole->jupyter->cazomevolve==0.0.4) (2.2.0)
Requirement already satisfied: soupsieve>1.2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from beautifulsoup4->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (2.3.2.post1)
Requirement already satisfied: colorama in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from easydev>=0.12.0->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (0.4.6)
Requirement already satisfied: pexpect in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from easydev>=0.12.0->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (4.8.0)
Requirement already satisfied: zipp>=3.1.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from importlib-resources>=3.2.0->matplotlib->adjustText->cazomevolve==0.0.4) (3.15.0)
Requirement already satisfied: typing-extensions in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (4.5.0)
Requirement already satisfied: backcall in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (0.2.0)
Requirement already satisfied: jedi>=0.16 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (0.18.2)
Requirement already satisfied: decorator in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (5.1.1)
Requirement already satisfied: stack-data in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (0.6.2)
Requirement already satisfied: matplotlib-inline in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (0.1.6)
Requirement already satisfied: pickleshare in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (0.7.5)
Requirement already satisfied: platformdirs>=2.5 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter-core>=4.7->nbconvert->jupyter->cazomevolve==0.0.4) (3.2.0)
Requirement already satisfied: notebook-shim>=0.1.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbclassic>=0.4.7->notebook->jupyter->cazomevolve==0.0.4) (0.2.2)
Requirement already satisfied: jupyter-server>=1.8 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbclassic>=0.4.7->notebook->jupyter->cazomevolve==0.0.4) (1.23.4)
Requirement already satisfied: jsonschema>=2.6 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbformat>=5.1->nbconvert->jupyter->cazomevolve==0.0.4) (4.17.3)
Requirement already satisfied: fastjsonschema in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from nbformat>=5.1->nbconvert->jupyter->cazomevolve==0.0.4) (2.16.3)
Requirement already satisfied: wcwidth in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->jupyter-console->jupyter->cazomevolve==0.0.4) (0.2.6)
Requirement already satisfied: ptyprocess in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from terminado>=0.8.3->notebook->jupyter->cazomevolve==0.0.4) (0.7.0)
Requirement already satisfied: argon2-cffi-bindings in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from argon2-cffi->notebook->jupyter->cazomevolve==0.0.4) (21.2.0)
Requirement already satisfied: webencodings in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from bleach->nbconvert->jupyter->cazomevolve==0.0.4) (0.5.1)
Requirement already satisfied: gevent in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from grequests->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (22.10.2)
Requirement already satisfied: attrs>=21.2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from requests-cache->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (22.2.0)
Requirement already satisfied: cattrs>=22.2 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from requests-cache->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (22.2.0)
Requirement already satisfied: url-normalize>=1.4 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from requests-cache->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (1.4.3)
Requirement already satisfied: exceptiongroup in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cattrs>=22.2->requests-cache->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (1.1.1)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jedi>=0.16->ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (0.8.3)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter->cazomevolve==0.0.4) (0.18.0)
Requirement already satisfied: websocket-client in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter-server>=1.8->nbclassic>=0.4.7->notebook->jupyter->cazomevolve==0.0.4) (1.5.1)
Requirement already satisfied: anyio<4,>=3.1.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from jupyter-server>=1.8->nbclassic>=0.4.7->notebook->jupyter->cazomevolve==0.0.4) (3.6.2)
Requirement already satisfied: cffi>=1.0.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from argon2-cffi-bindings->argon2-cffi->notebook->jupyter->cazomevolve==0.0.4) (1.15.0)
Requirement already satisfied: zope.event in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from gevent->grequests->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (4.6)
Requirement already satisfied: setuptools in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from gevent->grequests->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (65.6.3)
Requirement already satisfied: zope.interface in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from gevent->grequests->bioservices>=1.10.4->cazy_webscraper->cazomevolve==0.0.4) (6.0)
Requirement already satisfied: executing>=1.2.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from stack-data->ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (1.2.0)
Requirement already satisfied: pure-eval in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from stack-data->ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (0.2.2)
Requirement already satisfied: asttokens>=2.1.0 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from stack-data->ipython>=5.0.0->ipykernel->jupyter->cazomevolve==0.0.4) (2.2.1)
Requirement already satisfied: sniffio>=1.1 in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from anyio<4,>=3.1.0->jupyter-server>=1.8->nbclassic>=0.4.7->notebook->jupyter->cazomevolve==0.0.4) (1.3.0)
Requirement already satisfied: pycparser in /home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook->jupyter->cazomevolve==0.0.4) (2.21)
Installing collected packages: cazomevolve
  Attempting uninstall: cazomevolve
    Found existing installation: cazomevolve 0.0.4
    Uninstalling cazomevolve-0.0.4:
      Successfully uninstalled cazomevolve-0.0.4
  Running setup.py develop for cazomevolve
Successfully installed cazomevolve-0.0.4
In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statistics
import re

from copy import copy
from matplotlib.patches import Patch
from pathlib import Path
import upsetplot
import adjustText
import upsetplot

from Bio import SeqIO
from saintBioutils.utilities.file_io.get_paths import get_file_paths
from saintBioutils.utilities.file_io import make_output_directory
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from tqdm.notebook import tqdm

%matplotlib inline
In [3]:
# loading and parsing data
from cazomevolve.cazome.explore.parse_data import (
    load_fgp_data,
    load_tax_data,
    add_tax_data_from_tax_df,
    add_tax_column_from_row_index,
)

# functions for exploring the sizes of CAZomes
from cazomevolve.cazome.explore.cazome_sizes import (
    calc_proteome_representation,
    count_items_in_cazome,
    get_proteome_sizes,
    count_cazyme_fam_ratio,
)

# explore the frequency of CAZymes per CAZy class
from cazomevolve.cazome.explore.cazy_classes import calculate_class_sizes

# explore the frequencies of CAZy families and identify the co-cazome
from cazomevolve.cazome.explore.cazy_families import (
    build_fam_freq_df,
    build_row_colours,
    build_family_clustermap,
    identify_core_cazome,
    plot_fam_boxplot,
    build_fam_mean_freq_df,
    get_group_specific_fams,
    build_family_clustermap_multi_legend,
)

# functions to identify and explore CAZy families that are always present together
from cazomevolve.cazome.explore.cooccurring_families import (
    identify_cooccurring_fams_corrM,
    calc_cooccuring_fam_freqs,
    identify_cooccurring_fam_pairs,
    add_to_upsetplot_membership,
    build_upsetplot,
    get_upsetplot_grps,
    add_upsetplot_grp_freqs,
    build_upsetplot_matrix,
)

# functions to perform PCA
from cazomevolve.cazome.explore.pca import (
    perform_pca,
    plot_explained_variance,
    plot_scree,
    plot_pca,
    plot_loadings,
)

Output directory¶

Make the parent output directory. Make subdirectories as and when needed throughout the notebook.

Use the function make_output_directory from saintBioutils. One positional argument is required: the path to the target output directory to be build - this must be a Path object.

Always set force and nodelete to True - this ensures the output directory is created, and if it exists content in the output directory is not deleted.

In [4]:
# make output dir for results, will not delete data if dir already exists
make_output_directory(Path('../results/pectobact'), force=True, nodelete=True)
Output directory ../results/pectobact exists, nodelete is True. Adding output to output directory.

Data¶

CAZy family annotations: The GFP file

Load tab delimited list of cazy families, genomes and protein accessions, by providing the path to the 'gfp file' to load_gfp_data().

Each unique protein-family pair is represented on a separate line. Owing to a protein potentially containing multiple CAZyme domains and thus can be annotated with multiple CAZy families, a single protein can be present on multiple rows in the gfp_df.

In [5]:
fgp_file = "../data/pectobact/cazomes/pecto_fam_genomes_proteins"
fgp_df = load_fgp_data(fgp_file)
fgp_df.head(3)
Out[5]:
Family Genome Protein
0 CBM50 GCA_003382565.3 UEM40323.1
1 GT35 GCA_003382565.3 UEM39157.1
2 GH5 GCA_003382565.3 UEM41238.1
In [6]:
print(f"Total CAZymes: {len(set(fgp_df['Protein']))}")
Total CAZymes: 78132

Taxonomy data:

Load in CSV of tax data from generated by cazevolve_add_taxs, by providing a path to the file to load_tax_data(), and specify which tax ranks (kingdom, phylum, etc.) are included in the CSV file.

In [7]:
help(load_tax_data)
Help on function load_tax_data in module cazomevolve.cazome.explore.parse_data:

load_tax_data(tax_csv_path, kingdom=False, phylum=False, tax_class=False, tax_order=False, tax_family=False, genus=False, species=False)
    Load tax data compiled by cazomevolve into a pandas df
    
    :param tax_csv_path: str/Path to csv file of genome, tax_rank, tax_rank
        e.g. 'Genome', 'Genus', 'Species'
    The remaining params are bool checks for lineage ranks included in the tax data file
    
    Return df of genome, tax_rank, tax_rank,  e.g. 'Genome', 'Genus', 'Species'

In [8]:
tax_csv_path = "../data/pectobact/cazomes/fg_genome_taxs.csv"
tax_df = load_tax_data(tax_csv_path, genus=True, species=True)
tax_df.head(3)
Out[8]:
Genome Genus Species
0 GCA_922021645.1 Pectobacterium versatile
1 GCA_004296685.1 Pectobacterium versatile
2 GCA_018094705.1 Pectobacterium versatile

Compile all data into a single dataframe:

Build dataframe of:

  • CAZy family annotations
  • Genomic accession
  • Taxonomic information - splitting each taxonomy rank (i.e. ranks) into a separate column. E.g.:
    • Genus
    • Species
In [9]:
fgp_df = add_tax_data_from_tax_df(
    fgp_df,
    tax_df,
    genus=True,
    species=True,
)
fgp_df.head(3)
Collecting Genus data: 100%|██████████| 83143/83143 [00:56<00:00, 1484.36it/s]
Collecting Species data: 100%|██████████| 83143/83143 [00:57<00:00, 1436.44it/s]
Out[9]:
Family Genome Protein Genus Species
0 CBM50 GCA_003382565.3 UEM40323.1 Pectobacterium aquaticum
1 GT35 GCA_003382565.3 UEM39157.1 Pectobacterium aquaticum
2 GH5 GCA_003382565.3 UEM41238.1 Pectobacterium aquaticum
In [10]:
print(f"Total CAZymes: {len(set(fgp_df['Protein']))}")
Total CAZymes: 78132

1. CAZome size¶

Calculate the number of CAZymes per genome (defined as the number of unique protein accessions per genome).

In total, calculate:

  • The number of CAZymes per genome
  • The mean number of CAZymes per genome per genus
  • The proportion of the proteome represented by the CAZome
  • The mean proportion of the proteome represented by the CAZome

Use the count_items_in_cazome() function to retrieve the number of CAZymes and the number of CAZy families per genome, and the mean counts per genus.

In [11]:
help(count_items_in_cazome)
Help on function count_items_in_cazome in module cazomevolve.cazome.explore.cazome_sizes:

count_items_in_cazome(gfp_df, item, grp, round_by=None)
    Count the number of unique items per genome and per specificed tax grouping
    
    :param gfp_df: panda df, cols = ['Family', 'Genome', 'Protein', 'tax grp', 'tax grp'...]
    :param item: str, name of column to calculate incidence for, e.g. 'Protein' or 'Family'
    :param grp: str, name of column to group genomes by
    :param round_by: int, number of figures to round mean and sd by. If None do not round
    
    Return
    * dict of {grp: {genome: {'items': {items}, 'numOfItems': int(num of items)}}}
    * df, cols = []

In [12]:
# check all genomes are represented in the fgp_df
f"Examining {len(set(fgp_df['Genome']))} genomes"
Out[12]:
'Examining 717 genomes'
In [13]:
print(f"Examining {len(set(fgp_df['Genus']))} genera:")
for genus in set(fgp_df['Genus']):
    print(f'- {genus}')
Examining 8 genera:
- Acerihabitans
- Dickeya
- Samsonia
- Lonsdalea
- Pectobacterium
- Brenneria
- Musicola
- Affinibrenneria
In [14]:
# Calculate CAZymes per genome
cazome_sizes_dict, cazome_sizes_df = count_items_in_cazome(fgp_df, 'Protein', 'Genus', round_by=2)
cazome_sizes_df
Gathering CAZy families per genome: 100%|██████████| 83143/83143 [00:09<00:00, 9131.09it/s]
Calculating num of Protein per genome and per Genus: 100%|██████████| 8/8 [00:00<00:00, 2674.94it/s]
Out[14]:
Genus MeanProteins SdProteins NumOfGenomes
0 Pectobacterium 112.65 8.02 432
1 Dickeya 111.16 6.60 206
2 Musicola 92.25 2.28 4
3 Brenneria 87.79 7.46 33
4 Lonsdalea 77.15 4.70 39
5 Acerihabitans 106.00 0.00 1
6 Affinibrenneria 108.00 0.00 1
7 Samsonia 81.00 0.00 1
In [15]:
# calculate mean across pectobacteriaceae
pectobact_cazome_sizes = []
for genus in cazome_sizes_dict:
    for genome in cazome_sizes_dict[genus]:
        pectobact_cazome_sizes.append(cazome_sizes_dict[genus][genome]['numOfProteins'])

pd.concat(
    [
        cazome_sizes_df, 
        pd.DataFrame(
            [[
                'Pectobacteriaceae',
                np.mean(pectobact_cazome_sizes),
                np.std(pectobact_cazome_sizes),
                len(set(fgp_df['Genome'])),
            ]], 
            columns=cazome_sizes_df.columns
        ),
    ],
    axis=0,
)
Out[15]:
Genus MeanProteins SdProteins NumOfGenomes
0 Pectobacterium 112.650000 8.020000 432
1 Dickeya 111.160000 6.600000 206
2 Musicola 92.250000 2.280000 4
3 Brenneria 87.790000 7.460000 33
4 Lonsdalea 77.150000 4.700000 39
5 Acerihabitans 106.000000 0.000000 1
6 Affinibrenneria 108.000000 0.000000 1
7 Samsonia 81.000000 0.000000 1
0 Pectobacteriaceae 108.970711 11.958225 717
In [16]:
# Calculate CAZy families per genome
cazome_fam_dict, cazome_fams_df = count_items_in_cazome(fgp_df, 'Family', 'Genus', round_by=2)
cazome_fams_df
Gathering CAZy families per genome: 100%|██████████| 83143/83143 [00:08<00:00, 9343.02it/s]
Calculating num of Family per genome and per Genus: 100%|██████████| 8/8 [00:00<00:00, 3171.20it/s]
Out[16]:
Genus MeanFamilys SdFamilys NumOfGenomes
0 Pectobacterium 62.08 3.46 432
1 Dickeya 59.05 3.04 206
2 Musicola 50.25 0.43 4
3 Brenneria 53.67 3.71 33
4 Lonsdalea 42.62 2.14 39
5 Acerihabitans 48.00 0.00 1
6 Affinibrenneria 48.00 0.00 1
7 Samsonia 49.00 0.00 1
In [17]:
# calculate mean across pectobacteriaceae
pectobact_fam_nums = []
for genus in cazome_fam_dict:
    for genome in cazome_fam_dict[genus]:
        pectobact_fam_nums.append(cazome_fam_dict[genus][genome]['numOfFamilys'])

pd.concat(
    [
        cazome_fams_df, 
        pd.DataFrame(
            [[
                'Pectobacteriaceae',
                np.mean(pectobact_fam_nums),
                np.std(pectobact_fam_nums),
                len(set(fgp_df['Genome'])),
            ]], 
            columns=cazome_fams_df.columns
        ),
    ],
    axis=0,
)
Out[17]:
Genus MeanFamilys SdFamilys NumOfGenomes
0 Pectobacterium 62.080000 3.460000 432
1 Dickeya 59.050000 3.040000 206
2 Musicola 50.250000 0.430000 4
3 Brenneria 53.670000 3.710000 33
4 Lonsdalea 42.620000 2.140000 39
5 Acerihabitans 48.000000 0.000000 1
6 Affinibrenneria 48.000000 0.000000 1
7 Samsonia 49.000000 0.000000 1
0 Pectobacteriaceae 59.638773 5.733681 717

Identify the total number of CAZymes

In [18]:
print(f"The total number of CAZymes is {len(set(fgp_df['Protein']))}")

for genus in set(fgp_df['Genus']):
    genus_df = fgp_df[fgp_df['Genus'] == genus]
    print(f"The total number of {genus} CAZymes is {len(set(genus_df['Protein']))}")
The total number of CAZymes is 78132
The total number of Acerihabitans CAZymes is 106
The total number of Dickeya CAZymes is 22899
The total number of Samsonia CAZymes is 81
The total number of Lonsdalea CAZymes is 3009
The total number of Pectobacterium CAZymes is 48663
The total number of Brenneria CAZymes is 2897
The total number of Musicola CAZymes is 369
The total number of Affinibrenneria CAZymes is 108

Look at the ratio of CAZymes to CAZy families.

In [19]:
cazome_ratio_dict, cazome_ratio_df = count_cazyme_fam_ratio(fgp_df, 'Genus', round_by=2)
cazome_ratio_df
Gathering CAZymes and CAZy families per genome: 100%|██████████| 83143/83143 [00:09<00:00, 9012.11it/s]
Calculating CAZyme/CAZy family ratio: 100%|██████████| 8/8 [00:00<00:00, 3636.15it/s]
Out[19]:
Genus MeanCAZymeToFamRatio SdCAZymeToFamRatio NumOfGenomes
0 Pectobacterium 1.81 0.09 432
1 Dickeya 1.88 0.07 206
2 Musicola 1.84 0.04 4
3 Brenneria 1.64 0.07 33
4 Lonsdalea 1.81 0.07 39
5 Acerihabitans 2.21 0.00 1
6 Affinibrenneria 2.25 0.00 1
7 Samsonia 1.65 0.00 1
In [20]:
pecto_ratios = []
for genus in cazome_sizes_dict:
    for genome in cazome_sizes_dict[genus]:
        ratio = (cazome_sizes_dict[genus][genome]['numOfProteins'] / cazome_fam_dict[genus][genome]['numOfFamilys'])
        pecto_ratios.append(ratio)

pd.concat(
    [
        cazome_ratio_df,
        pd.DataFrame(
            [[
                'Pectobacteriaceae',
                np.mean(pecto_ratios),
                np.std(pecto_ratios),
                len(set(fgp_df['Genome'])),
            ]],
            columns=cazome_ratio_df.columns
        )
    ],
    axis=0,
)
Out[20]:
Genus MeanCAZymeToFamRatio SdCAZymeToFamRatio NumOfGenomes
0 Pectobacterium 1.810000 0.09000 432
1 Dickeya 1.880000 0.07000 206
2 Musicola 1.840000 0.04000 4
3 Brenneria 1.640000 0.07000 33
4 Lonsdalea 1.810000 0.07000 39
5 Acerihabitans 2.210000 0.00000 1
6 Affinibrenneria 2.250000 0.00000 1
7 Samsonia 1.650000 0.00000 1
0 Pectobacteriaceae 1.826642 0.09796 717

Proteome sizes:

In [21]:
# Get the size of the proteome (the number of protein acc) per genome
grp = 'Genus'
proteome_dir = "../data/pectobact/proteomes"
proteome_dict = get_proteome_sizes(proteome_dir, fgp_df, grp)
Getting proteome sizes: 100%|██████████| 717/717 [00:41<00:00, 17.09it/s]
In [22]:
# get total number of proteins across all proteomes
total_proteins = 0
for genus in proteome_dict:
    for genome in proteome_dict[genus]:
        total_proteins += proteome_dict[genus][genome]['numOfProteins']
print(f"Total number of proteins across all genomes: {total_proteins}")
Total number of proteins across all genomes: 2994018
In [23]:
# Calculate the mean proteome size by genus and the proportion of the proteome represented by the CAZome
proteome_perc_df = calc_proteome_representation(proteome_dict, cazome_sizes_dict, grp, round_by=2)
proteome_perc_df
Getting proteome size: 100%|██████████| 8/8 [00:00<00:00, 7443.31it/s]
Calc proteome perc: 100%|██████████| 8/8 [00:00<00:00, 2392.81it/s]
Out[23]:
Genus MeanProteomeSize SdProteomeSize MeanProteomePerc SdProteomePerc NumOfGenomes
0 Pectobacterium 4260.71 216.78 2.64 0.15 432
1 Dickeya 4176.86 155.23 2.66 0.11 206
2 Musicola 3992.00 55.76 2.31 0.05 4
3 Brenneria 4270.24 478.85 2.07 0.15 33
4 Lonsdalea 3142.28 132.55 2.45 0.09 39
5 Acerihabitans 4969.00 0.00 2.13 0.00 1
6 Affinibrenneria 5064.00 0.00 2.13 0.00 1
7 Samsonia 3489.00 0.00 2.32 0.00 1
In [24]:
pectobact_average = ['Pectobacteriaceae']
for col in proteome_perc_df.columns[1:]:
    pectobact_average.append(np.mean(list(proteome_perc_df[col])))
pectobact_average[-1] == 660

df = pd.DataFrame([pectobact_average], columns=proteome_perc_df.columns)
pd.concat([proteome_perc_df, df], ignore_index=True, axis=0).round(2)
Out[24]:
Genus MeanProteomeSize SdProteomeSize MeanProteomePerc SdProteomePerc NumOfGenomes
0 Pectobacterium 4260.71 216.78 2.64 0.15 432.00
1 Dickeya 4176.86 155.23 2.66 0.11 206.00
2 Musicola 3992.00 55.76 2.31 0.05 4.00
3 Brenneria 4270.24 478.85 2.07 0.15 33.00
4 Lonsdalea 3142.28 132.55 2.45 0.09 39.00
5 Acerihabitans 4969.00 0.00 2.13 0.00 1.00
6 Affinibrenneria 5064.00 0.00 2.13 0.00 1.00
7 Samsonia 3489.00 0.00 2.32 0.00 1.00
8 Pectobacteriaceae 4170.51 129.90 2.34 0.07 89.62

For easier comparison and presentation, combine the dataframes made above into a single dataframe, with each row representing a different genus.

In [25]:
all_df = pd.concat([proteome_perc_df, cazome_sizes_df, cazome_fams_df, cazome_ratio_df], axis=1, join='inner')
make_output_directory(Path('../results/pectobact/cazome_size'), force=True, nodelete=True)
all_df.to_csv('../results/pectobact/cazome_size/cazome_sizes.csv')
all_df
Output directory ../results/pectobact/cazome_size exists, nodelete is True. Adding output to output directory.
Out[25]:
Genus MeanProteomeSize SdProteomeSize MeanProteomePerc SdProteomePerc NumOfGenomes Genus MeanProteins SdProteins NumOfGenomes Genus MeanFamilys SdFamilys NumOfGenomes Genus MeanCAZymeToFamRatio SdCAZymeToFamRatio NumOfGenomes
0 Pectobacterium 4260.71 216.78 2.64 0.15 432 Pectobacterium 112.65 8.02 432 Pectobacterium 62.08 3.46 432 Pectobacterium 1.81 0.09 432
1 Dickeya 4176.86 155.23 2.66 0.11 206 Dickeya 111.16 6.60 206 Dickeya 59.05 3.04 206 Dickeya 1.88 0.07 206
2 Musicola 3992.00 55.76 2.31 0.05 4 Musicola 92.25 2.28 4 Musicola 50.25 0.43 4 Musicola 1.84 0.04 4
3 Brenneria 4270.24 478.85 2.07 0.15 33 Brenneria 87.79 7.46 33 Brenneria 53.67 3.71 33 Brenneria 1.64 0.07 33
4 Lonsdalea 3142.28 132.55 2.45 0.09 39 Lonsdalea 77.15 4.70 39 Lonsdalea 42.62 2.14 39 Lonsdalea 1.81 0.07 39
5 Acerihabitans 4969.00 0.00 2.13 0.00 1 Acerihabitans 106.00 0.00 1 Acerihabitans 48.00 0.00 1 Acerihabitans 2.21 0.00 1
6 Affinibrenneria 5064.00 0.00 2.13 0.00 1 Affinibrenneria 108.00 0.00 1 Affinibrenneria 48.00 0.00 1 Affinibrenneria 2.25 0.00 1
7 Samsonia 3489.00 0.00 2.32 0.00 1 Samsonia 81.00 0.00 1 Samsonia 49.00 0.00 1 Samsonia 1.65 0.00 1
In [26]:
# calculate means for Pectobacteriaceae
for col in all_df:
    if col == 'Genus' or col == 'NumOfGenomes':
        continue
    print(col, '--', np.mean(list(all_df[col])).round(2))
MeanProteomeSize -- 4170.51
SdProteomeSize -- 129.9
MeanProteomePerc -- 2.34
SdProteomePerc -- 0.07
MeanProteins -- 97.0
SdProteins -- 3.63
MeanFamilys -- 51.58
SdFamilys -- 1.6
MeanCAZymeToFamRatio -- 1.89
SdCAZymeToFamRatio -- 0.04

2. CAZy classes¶

Calculate the number of CAZymes (identified as the number of unique protein accessions) per CAZy class. Also, calculate the mean size of CAZy classes (i.e. the mean number of unique protein accessions per CAZy class in each genome) per genus.

The results are added to a dataframe, which is written to results/pecto_dic/cazy_class_sizes.csv, and was used to generate a proportiona area plot using RawGraphs.

In [27]:
# make output dir for results, will not delete data if dir already exists
make_output_directory(Path('../results/pectobact/cazy_classes/'), force=True, nodelete=True)
Output directory ../results/pectobact/cazy_classes exists, nodelete is True. Adding output to output directory.
In [28]:
class_df, class_size_dict = calculate_class_sizes(fgp_df, 'Genus', round_by=2)
Getting CAZy class sizes: 100%|██████████| 83143/83143 [00:34<00:00, 2431.20it/s]
Calculating CAZy class sizes: 100%|██████████| 6/6 [00:00<00:00, 127.68it/s]
In [29]:
# add values with means across all genera to represent pectobacteriaceae
pectobact_class_means = []

for cazy_class in set(class_df['CAZyClass']):
    df = class_df[class_df['CAZyClass'] == cazy_class]
    new_row = [cazy_class, 'Pectobacteriaceae']
    
    for col in class_df.columns[2:]:
        mean = np.mean(df[col])
        new_row.append(mean)
    
    new_row[-1] = 660
    
    pectobact_class_means.append(new_row)

df = pd.DataFrame(pectobact_class_means, columns = class_df.columns)
all_class_df = pd.concat([class_df, df], axis=0, ignore_index=True)
all_class_df = all_class_df.round(2)
# replace nan with 0
all_class_df = all_class_df.fillna(0)

filtered_class_df = all_class_df[all_class_df['Genus'] != 'Haf']
all_class_df.to_csv('../results/pectobact/cazy_classes/cazy_class_sizes.csv')
all_class_df
Out[29]:
CAZyClass Genus MeanCazyClass SdCazyClass MeanClassPerc SdClassPerc NumOfGenomes
0 GH Acerihabitans 51.00 0.00 48.11 0.00 1
1 GH Dickeya 42.53 3.55 38.24 1.85 206
2 GH Samsonia 34.00 0.00 41.98 0.00 1
3 GH Lonsdalea 30.85 2.28 39.96 1.31 39
4 GH Pectobacterium 50.11 3.91 44.50 1.87 432
5 GH Brenneria 42.48 6.60 48.14 4.19 33
6 GH Musicola 37.00 0.00 40.13 0.99 4
7 GH Affinibrenneria 59.00 0.00 54.63 0.00 1
8 GT Acerihabitans 44.00 0.00 41.51 0.00 1
9 GT Dickeya 37.21 3.12 33.52 2.62 206
10 GT Samsonia 25.00 0.00 30.86 0.00 1
11 GT Lonsdalea 32.46 2.30 42.07 1.24 39
12 GT Pectobacterium 31.76 3.86 28.15 2.23 432
13 GT Brenneria 32.30 2.55 36.90 2.61 33
14 GT Musicola 31.00 3.00 33.55 2.44 4
15 GT Affinibrenneria 35.00 0.00 32.41 0.00 1
16 PL Acerihabitans 1.00 0.00 0.94 0.00 1
17 PL Dickeya 16.29 1.72 14.63 1.19 206
18 PL Samsonia 8.00 0.00 9.88 0.00 1
19 PL Lonsdalea 3.79 0.56 4.92 0.72 39
20 PL Pectobacterium 14.78 1.36 13.14 0.97 432
21 PL Brenneria 4.24 1.23 4.81 1.29 33
22 PL Musicola 11.25 0.43 12.19 0.33 4
23 PL Affinibrenneria 1.00 0.00 0.93 0.00 1
24 CE Acerihabitans 5.00 0.00 4.72 0.00 1
25 CE Dickeya 7.16 0.80 6.44 0.60 206
26 CE Samsonia 8.00 0.00 9.88 0.00 1
27 CE Lonsdalea 3.15 0.48 4.09 0.57 39
28 CE Pectobacterium 7.12 0.83 6.33 0.68 432
29 CE Brenneria 4.30 1.06 4.93 1.24 33
30 CE Musicola 6.00 0.00 6.51 0.16 4
31 CE Affinibrenneria 7.00 0.00 6.48 0.00 1
32 AA Acerihabitans 0.00 0.00 0.00 0.00 1
33 AA Dickeya 1.00 0.00 0.90 0.06 85
34 AA Samsonia 1.00 0.00 1.23 0.00 1
35 AA Lonsdalea 0.00 0.00 0.00 0.00 39
36 AA Pectobacterium 1.03 0.16 0.91 0.17 371
37 AA Brenneria 1.00 0.00 1.27 0.06 8
38 AA Musicola 0.00 0.00 0.00 0.00 4
39 AA Affinibrenneria 0.00 0.00 0.00 0.00 1
40 CBM Acerihabitans 11.00 0.00 10.38 0.00 1
41 CBM Dickeya 12.27 1.25 11.03 0.83 206
42 CBM Samsonia 10.00 0.00 12.35 0.00 1
43 CBM Lonsdalea 8.74 0.54 11.35 0.63 39
44 CBM Pectobacterium 13.98 1.49 12.41 1.02 432
45 CBM Brenneria 9.36 0.64 10.74 1.17 33
46 CBM Musicola 11.00 1.00 11.96 1.38 4
47 CBM Affinibrenneria 10.00 0.00 9.26 0.00 1
48 GT Pectobacteriaceae 33.59 1.85 34.87 1.39 660
49 GH Pectobacteriaceae 43.37 2.04 44.46 1.28 660
50 CBM Pectobacteriaceae 10.79 0.62 11.18 0.63 660
51 AA Pectobacteriaceae 0.50 0.02 0.54 0.04 660
52 CE Pectobacteriaceae 5.97 0.40 6.17 0.41 660
53 PL Pectobacteriaceae 7.54 0.66 7.68 0.56 660

Very few genomes contained any AA CAZymes. Identify the number of genomes were no AA CAZymes were found, additionally, find the maximum, minimum and mode number of AA CAZymes found across all 660 genomes.

In [30]:
# calc genomes with no AAs
no_aa_genomes = 0
for genus in class_size_dict['AA']:
    for genome in class_size_dict['AA'][genus]:
        no_aa_genomes+=1
print(f"{no_aa_genomes} genomes have no AAs")

aa_counts = [0] * no_aa_genomes
for genus in class_size_dict['AA']:
    for genome in class_size_dict['AA'][genus]:
        aa_counts.append(len(class_size_dict['AA'][genus][genome]['proteins']))
print(f"Max: {max(aa_counts)}\nMin: {min(aa_counts)}\nMode: {statistics.mode(aa_counts)}")
465 genomes have no AAs
Max: 2
Min: 0
Mode: 0

Count the number of genomes were 1 or 2 AA CAZymes were found.

In [31]:
# find genomes with 2 AAs
two_aa_genomes = {}
one_aa_genomes = {}

for genus in class_size_dict['AA']:
    for genome in class_size_dict['AA'][genus]:
        if len(class_size_dict['AA'][genus][genome]['proteins']) == 2:
            try:
                two_aa_genomes[genus].add(genome)
            except KeyError:
                two_aa_genomes[genus] = {genome}
                
        elif len(class_size_dict['AA'][genus][genome]['proteins']) == 1:
            try:
                one_aa_genomes[genus].add(genome)
            except KeyError:
                one_aa_genomes[genus] = {genome}

two_aa_genomes
Out[31]:
{'Pectobacterium': {'GCA_000738125.1',
  'GCA_000749915.1',
  'GCA_011378985.1',
  'GCA_011379045.1',
  'GCA_020971565.1',
  'GCA_024343355.1',
  'GCA_024722495.1',
  'GCA_028335745.1',
  'GCA_900195285.2',
  'GCA_900195295.2'}}
In [32]:
for genus in one_aa_genomes:
    print(f"{genus}: {len(one_aa_genomes[genus])}")
# 83.56
# 41.26
# 24.24
# 100
Pectobacterium: 361
Dickeya: 85
Brenneria: 8
Samsonia: 1

3. CAZy families¶

CAZy family frequency dataframe¶

Calculate the number of CAZymes per CAZy family presented in each genome, where the number of CAZymes is the number of unqiue protein accessions. This value may be greater than the number of CAZymes in the genome because a CAZyme may be annotated with multiple CAZy families.

In [33]:
# make output directory
# make output dir for results, will not delete data if dir already exists
make_output_directory(Path('../results/pectobact/cazy_families/'), force=True, nodelete=True)
Output directory ../results/pectobact/cazy_families exists, nodelete is True. Adding output to output directory.
In [34]:
fam_freq_df = build_fam_freq_df(fgp_df, ['Genus', 'Species'])
fam_freq_df
The dataset contains 117 CAZy families
Counting fam frequencies: 100%|██████████| 717/717 [01:01<00:00, 11.74it/s]
Out[34]:
Genome Genus Species AA10 AA3 CBM0 CBM13 CBM18 CBM3 CBM32 ... PL11 PL17 PL2 PL22 PL26 PL3 PL35 PL38 PL4 PL9
0 GCA_009874285.1 Dickeya dianthicola 0 0 0 0 0 0 0 ... 0 0 1 1 1 2 0 0 2 3
1 GCA_002307355.1 Pectobacterium polaris 0 1 0 1 0 1 1 ... 0 0 2 1 1 2 0 1 1 2
2 GCA_016950075.1 Pectobacterium brasiliense 0 1 0 1 0 1 2 ... 0 0 2 1 1 2 0 0 1 2
3 GCA_020295565.1 Pectobacterium versatile 0 1 0 1 0 1 1 ... 0 0 2 1 1 2 0 0 1 2
4 GCA_024343475.1 Pectobacterium carotovorum subsp. carotovorum 0 1 0 1 0 1 1 ... 0 0 2 1 1 2 0 0 1 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
712 GCA_003628015.1 Pectobacterium parmentieri 0 0 0 1 0 1 1 ... 0 0 2 1 1 1 0 0 1 2
713 GCA_003666235.1 Brenneria goodwinii 0 0 0 0 0 1 0 ... 0 0 1 1 0 1 0 0 0 0
714 GCA_020406975.1 Dickeya solani 0 1 0 0 0 0 0 ... 0 0 1 1 1 1 0 0 1 3
715 GCA_018634035.1 Pectobacterium atrosepticum 0 1 0 1 0 1 1 ... 0 0 2 1 1 2 0 1 1 2
716 GCA_016950155.1 Pectobacterium punjabense 0 1 0 1 0 1 1 ... 0 0 2 1 1 2 0 0 1 2

717 rows × 120 columns

In [35]:
fam_freq_df.to_csv("../results/pectobact/cazy_families/cazy_fam_freqs.csv")

Clustermaps¶

Build clustermap of CAZy family frequencies, with additional row colours marking the genus classification of each genome (i.e. each row).

Prepare the dataframe of CAZy family frequencies:

Index fam_freq_df so that each row name contains the genome, Genus and Species, so that the genomic accession, genus and species is included in the clustermap.

In [36]:
# index the taxonomy data and genome (ggs=genome_genus_species)
fam_freq_df_ggs = copy(fam_freq_df)  # so does not alter fam_freq_df
fam_freq_df_ggs = fam_freq_df_ggs.set_index(['Genome','Genus','Species'])
fam_freq_df_ggs.head(1)
Out[36]:
AA10 AA3 CBM0 CBM13 CBM18 CBM3 CBM32 CBM4 CBM48 CBM5 ... PL11 PL17 PL2 PL22 PL26 PL3 PL35 PL38 PL4 PL9
Genome Genus Species
GCA_009874285.1 Dickeya dianthicola 0 0 0 0 0 0 0 0 2 2 ... 0 0 1 1 1 2 0 0 2 3

1 rows × 117 columns

Colour scheme:

Define a colour scheme to colour code the rows by, in this case by the genus of the species.

To do this, add a column containing the data to be used to colour code each row, e.g. a genus. This extra column is removed by build_row_colours(). The dataframe that is parsed to build_row_colours() must be the dataframe that is used to generate a clustermap, otherwise Seaborn will not be able to map the row oclours correctly and no row colours will be produced.

In [37]:
# define a colour scheme to colour code rows by genus
fam_freq_df_ggs['Genus'] = list(fam_freq_df['Genus'])  # add column to use for colour scheme, is removed
fam_freq_genus_row_colours, fam_g_lut = build_row_colours(fam_freq_df_ggs, 'Genus', 'Set2')

Build a clustermap of CAZy family frequencies:

Use the function build_family_clustermap() from cazomevolve to build clustermaps of the CAZy family frequencies, with different combinations of additional row colours. For example, the row colours could list the genus and/or species classification of each genome.

In [38]:
help(build_family_clustermap)
Help on function build_family_clustermap in module cazomevolve.cazome.explore.cazy_families:

build_family_clustermap(df, row_colours=None, fig_size=None, file_path=None, file_format='png', font_scale=1, dpi=300, dendrogram_ratio=None, lut=None, legend_title='', title_fontsize='2', legend_fontsize='2', bbox_to_anchor=(1, 1), cmap=<matplotlib.colors.ListedColormap object at 0x7fbe1c401880>, cbar_pos=(0.02, 0.8, 0.05, 0.18))
    Build a clustermap of the CAZy family frequencies per genome
    
    :param df: df of CAZy family frequencies per genome
    :param row_colours: pandas map - used to define additional row colours. or list of maps for 
        multiple sets of row colours. If None, additional row colours are not plotted
    :param fig_size: tuple (width, height) of final figure. If None, decided by Seaborn
    :param file_path: path to save image to. If None, the figure is not written to a file
    :param file_format: str, file format to save figure to. Default 'png'
    :param font_scale: int, scale text - use if text is overlapping. <1 to reduce 
        text size
    :param dpi: dpi of saved figure
    :param dendrogram_ratio: Proportion of the figure size devoted to the dendrograms.
        If a pair is given, they correspond to (row, col) ratios.
    :param lut: lut from generating colour scheme, add to include legend in the plot7
    :param legend_title: str, title of legend for row colours
    :title_fontsize: int or {'xx-small', 'x-small', 'small', 'medium', 'large', 'x-large', 'xx-large'}
        The font size of the legend's title.
    :legend_fontsize: int or {'xx-small', 'x-small', 'small', 'medium', 'large', 'x-large', 'xx-large'}
    :param bbox_to_anchor: tuple, coordinates to place legend
    :param cmap: Seaborn cmap to be used for colour scheme of the heat/clustermap
    :param cbar_pos: from seaborn.clustermap, position and size of colour scale key/bar
        seaborn default=(0.02, 0.8, 0.05, 0.18) - left, bottom, width, height
    
    Return clustermap object

In [39]:
# make a figure that is full size, and all data is legible
large_fam_clustermap = build_family_clustermap(
    fam_freq_df_ggs,
    row_colours=fam_freq_genus_row_colours,
    fig_size=(40,120),
    file_path="../results/pectobact/cazy_families/fam_freq_clustermap.svg",
    file_format='svg',
    lut=fam_g_lut,
    legend_title='Genus',
    dendrogram_ratio=(0.2,0.05),
    title_fontsize=28,
    legend_fontsize=24,
    cbar_pos=(0, 0.95, 0.05, 0.05),
)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
In [40]:
# make a figure the optimal size to fit in a paper
build_family_clustermap(
    fam_freq_df_ggs,
    row_colours=fam_freq_genus_row_colours,
    fig_size=(20,35),
    file_path="../results/pectobact/cazy_families/paper_fam_freq_clustermap.png",
    file_format='png',
    font_scale=0.5,
    lut=fam_g_lut,
    legend_title='Genus',
    dendrogram_ratio=(0.1,0.05),
    title_fontsize=18,
    legend_fontsize=16,
    cbar_pos=(0, 0.95, 0.05, 0.05),
)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[40]:
<seaborn.matrix.ClusterGrid at 0x7fbe1a79d250>

Add species classifications¶

Looking at the species names in the clustermap, there appears to be clustering of the genomes in a manner that correlates not only with their genus classificaiton but also their species classification. Therefore, add an additional row of row-colours, marking the species classification of each genome.

In [41]:
# define a colour scheme to colour code rows by SPECIES
fam_freq_df_ggs['Species'] = list(fam_freq_df['Species'])  # add column to use for colour scheme, is removed
fam_freq_species_row_colours, fam_s_lut = build_row_colours(fam_freq_df_ggs, 'Species', 'rainbow')
In [42]:
# make a figure the optimal size to fit in a paper
build_family_clustermap_multi_legend(
    df=fam_freq_df_ggs,
    row_colours=[fam_freq_genus_row_colours,fam_freq_species_row_colours],
    luts=[fam_g_lut, fam_s_lut],
    legend_titles=['Genus', 'Species'],
    bbox_to_anchors=[(0.2,1.045), (0.63,1.04)],
    legend_cols=[1,5],
    fig_size=(20,40),
    file_path="../results/pectobact/cazy_families/paper_genus_species_fam_freq_clustermap.png",
    file_format='png',
    font_scale=1,
    dendrogram_ratio=(0.1,0.05),
    title_fontsize=18,
    legend_fontsize=16,
    cbar_pos=(0.01, 0.96, 0.1, 0.1),  #left, bottom, width, height
)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[42]:
<seaborn.matrix.ClusterGrid at 0x7fbe0d99edc0>

Add phenotype classification to clustermap¶

In [43]:
# define a colour scheme to colour code SOFT vs HARD plant tissue targeting genomes
phenotype_col = []
for ri in range(len(fam_freq_df_ggs)):
    if list(fam_freq_df['Genus'])[ri] in ['Pectobacterium', 'Dickeya', 'Musicola']:
        phenotype_col.append('Soft tissue targeting')
    else:
        phenotype_col.append('Hard tissue targeting')
fam_freq_df_ggs['Phenotype'] = phenotype_col
fam_freq_pheno_row_colours, fam_p_lut = build_row_colours(fam_freq_df_ggs, 'Phenotype', "Set1")
In [44]:
build_family_clustermap_multi_legend(
    df=fam_freq_df_ggs,
    row_colours=[fam_freq_pheno_row_colours, fam_freq_genus_row_colours],
    luts=[fam_p_lut, fam_g_lut],
    legend_titles=['Phenotype', 'Genus'],
    bbox_to_anchors=[(0.225,1.045), (0.63,1.04)],
    legend_cols=[1,5],
    fig_size=(25,41),
    file_path="../results/pectobact/cazy_families/paper_pheno_genus_fam_freq_clustermap.png",
    file_format='png',
    font_scale=0.8,
    dendrogram_ratio=(0.1,0.05),
    title_fontsize=18,
    legend_fontsize=16,
    cbar_pos=(0.01, 0.96, 0.1, 0.1),  #left, bottom, width, height
)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[44]:
<seaborn.matrix.ClusterGrid at 0x7fbe0cdd5310>

Remove genomes¶

In the clustermaps the genomes GCA_029023745.1 (Pectobacterium colocasium), GCA_000749925.1 and GCA_000749945.1 (Pectobacterium betavasulorum) contained under estimated representations of their respective CAZomes.

Six Pectobacterium genomes were not included within the main Pectobacterium subtree (dendrogram on the RHS of clustermap):

  • GCA_000749925.1 and GCA_000749845.1 (Pectobacterium betavasulorum)
  • GCA_000803215.1 (Pectobacterium fontis)
  • GCA_025946765.1 (Pectobacterium cacticida)
  • GCA_004137815.1 (Pectobacterium zantedeschiae)
  • GCA_029023745.1 (Pectobacterium colocasium)

Extracted from the paper:

The genomes appeared to contain fewer total CAZymes (inferred from the lower CAZy family frequencies) than other Pectobacterium genomes, inferring a potential underestimation of their CAZyme features. Genomes GCA_000749925.1, GCA_000749845.1, and GCA_000803215.1 were were listed with the assembly status 'contig' in NCBI (June 2021). Genomic assemblies with the assembly status of 'contig' may contain incomplete genomic sequences. Indeed, the reported CheckM (Parks et al 2015, Genome Res) analysis listed the GCA_000749925.1 and GCA_000749845.1 as missing 5% (100th percentile) of their genomes with 2.25-2.5% contamination, and GCA_000803215.1 as missing 10% (100th percentile). Furthermore, although listed with the assembly status 'complete genome', assembly GCA_025946765.1 was listed as missing 19% (100th percentile) of its genome by CheckM, and the scaffold GCA_004137815.1 was listed as missing 11% (33rd percentile) with 9% contamination. Therefore, the annotated proteomes potentially underestimates the number of features (including CAZymes) in the genomes, and were excluded from the downstream analyses. The genome GCA_029023745.1 was listed with the assembly status 'complete genome', but the NCBI Prokaryotic Genome Annotation Pipeline (PGAGP) output contained a suspiciously high number of frameshifted proteins (greater than 30%), inferring a potentially poor annotation of the genome that may have resulted in an underestimation of its CAZyme features. Therefore, this genome was also excluded from downstream analyses.

In [45]:
genomes_to_remove = [
    'GCA_000749925.1',
    'GCA_000749845.1',
    'GCA_000803215.1',
    'GCA_025946765.1',
    'GCA_004137815.1',
    'GCA_029023745.1',
]
fam_freq_filtered_df = fam_freq_df[~fam_freq_df['Genome'].isin(genomes_to_remove)]
print(f"Original df length: {len(fam_freq_df)}\nLength after removing genome: {len(fam_freq_filtered_df)}")
Original df length: 717
Length after removing genome: 711

Replot the clustermap¶

Replot the clustermap, exlucding the removed genomes.

In [46]:
fam_freq_filtered_df_ggs = fam_freq_filtered_df.set_index(['Genome', 'Genus', 'Species'])
In [47]:
# make a figure the optimal size to fit in a paper
build_family_clustermap(
    fam_freq_df_ggs,
    row_colours=fam_freq_genus_row_colours,
    fig_size=(20,70),
    file_path="../results/pectobact/cazy_families/paper_fam_freq_clustermap_FILTERED.svg",
    file_format='svg',
    font_scale=0.5,
    lut=fam_g_lut,
    legend_title='Genus',
    dendrogram_ratio=(0.1,0.05),
    title_fontsize=18,
    legend_fontsize=16,
    cbar_pos=(0, 0.95, 0.05, 0.05),
)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[47]:
<seaborn.matrix.ClusterGrid at 0x7fbe041fc880>

Genus specific CAZy families¶

Identify CAZy families that are only present in one group, e.g. one Genus, using the function get_group_specific_fams from cazomevolve.

Specifically, get_group_specific_fams returns two dicts:

  1. Group specific families: {group: {only unique fams}}
  2. All families per group: {group: {all fams}}
In [48]:
all_families = list(fam_freq_df.columns)[3:]
# dict {group: {only unique fams}} and dict {group: {all fams}}
unique_grp_fams, group_fams = get_group_specific_fams(fam_freq_filtered_df, 'Genus', all_families)
unique_grp_fams
Identifying fams in each Genus: 100%|██████████| 711/711 [00:15<00:00, 44.44it/s]
Identifying Genus specific fams: 100%|██████████| 8/8 [00:00<00:00, 20984.64it/s]
Out[48]:
{'Dickeya': {'CBM4', 'CE2', 'GH113', 'GH148', 'GH25', 'GH91', 'GT97', 'PL10'},
 'Pectobacterium': {'AA10',
  'CBM13',
  'GH121',
  'GH146',
  'GH18',
  'GT101',
  'GT102',
  'GT11',
  'GT111',
  'GT14',
  'GT24',
  'GT52',
  'PL11'},
 'Brenneria': {'GH106', 'GT21', 'PL17'},
 'Acerihabitans': {'GH127', 'GH15'}}

Identify families that are only found in hard plant tissue targeting genomes, and those families only found in soft plant tissue targeting species.

In [49]:
hard_soft_fams_dict = {'hard': set(), 'soft': set()}

for ri in tqdm(range(len(fam_freq_filtered_df)), desc="Identifying Soft and Hard plant tissue targeting families"):
    genus = fam_freq_filtered_df.iloc[ri]['Genus']
    
    if genus in ['Pectobacterium','Dickeya','Musicola']:
        grp = 'soft'
    else:
        grp = 'hard'
    
    for fam in fam_freq_filtered_df.columns[3:]:
        if fam_freq_filtered_df.iloc[ri][fam] > 1:
            hard_soft_fams_dict[grp].add(fam)

unique_hard_fams = hard_soft_fams_dict['hard'].difference(hard_soft_fams_dict['soft'])
unique_soft_fams = hard_soft_fams_dict['soft'].difference(hard_soft_fams_dict['hard'])

print("Hard plant tissue targeting specific families:")
for fam in unique_hard_fams:
    print(fam)
print("Soft plant tissue targeting specific families:")
for fam in unique_soft_fams:
    print(fam)
Identifying Soft and Hard plant tissue targeting families:   0%|          | 0/711 [00:00<?, ?it/s]
Hard plant tissue targeting specific families:
GH77
CE11
PL22
GH37
GT20
GH31
GT28
CE1
Soft plant tissue targeting specific families:
AA3
GT30
GH38
CBM32
PL26
GT97
CBM91
GT1
GH94
GT41
PL2
GH2
CBM13
GH5
GT32
GH18
GT56
GT19
GH103
CBM0
GH53
GT84
GH102
PL4
GH30
PL9
CBM63

Build into a df that will be similar to one presented in a paper/report.

In [50]:
# convert into df
unique_grp_data = []

unique_grp_fams['Hard tissue'] = unique_hard_fams
unique_grp_fams['Soft tissue'] = unique_soft_fams

for genus in unique_grp_fams:
    new_data = [genus]
    for cazy_class in ['GH', 'GT', 'CE', 'PL', 'AA', 'CBM']:
        added = False
        class_data = []
        for fam in unique_grp_fams[genus]:
            if fam.startswith(cazy_class):
                class_data.append(fam)
                added = True
        if added is False:
            class_data.append("")
        class_data.sort()
        new_data.append(", ".join(class_data))
            
    unique_grp_data.append(new_data)
    
unique_grp_df = pd.DataFrame(unique_grp_data, columns=['Genus', 'GH', 'GT', 'CE', 'PL', 'AA', 'CBM'])
unique_grp_df.to_csv("../results/pectobact/core_cazome/unique_grp_fams.tsv", sep='\t')
unique_grp_df
Out[50]:
Genus GH GT CE PL AA CBM
0 Dickeya GH113, GH148, GH25, GH91 GT97 CE2 PL10 CBM4
1 Pectobacterium GH121, GH146, GH18 GT101, GT102, GT11, GT111, GT14, GT24, GT52 PL11 AA10 CBM13
2 Brenneria GH106 GT21 PL17
3 Acerihabitans GH127, GH15
4 Hard tissue GH31, GH37, GH77 GT20, GT28 CE1, CE11 PL22
5 Soft tissue GH102, GH103, GH18, GH2, GH30, GH38, GH5, GH53... GT1, GT19, GT30, GT32, GT41, GT56, GT84, GT97 PL2, PL26, PL4, PL9 AA3 CBM0, CBM13, CBM32, CBM63, CBM91

Pectobacterium and Dickeya genus-specific families¶

In order to compare between this GenBank dataset and the RefSeq data set in explore_pecto_dic_cazomes.ipynb, drop all genomes not from Pectobacterium and Dickeya from the fam_freq_df dataframe, and repeat the analysis to identify genus specific CAZy families.

In [51]:
all_families = list(fam_freq_filtered_df.columns)[3:]
pd_fam_freq_df_filtered = fam_freq_filtered_df[fam_freq_filtered_df['Genus'].isin(
    ['Pectobacterium', 'Dickeya']
)]
# dict {group: {only unique fams}} and dict {group: {all fams}}
pd_unique_grp_fams, pd_group_fams = get_group_specific_fams(pd_fam_freq_df_filtered, 'Genus', all_families)
pd_unique_grp_fams
Identifying fams in each Genus: 100%|██████████| 632/632 [00:14<00:00, 43.69it/s]
Identifying Genus specific fams: 100%|██████████| 2/2 [00:00<00:00, 15060.34it/s]
Out[51]:
{'Dickeya': {'CBM4',
  'CE2',
  'GH113',
  'GH148',
  'GH25',
  'GH26',
  'GH91',
  'GT97',
  'PL0',
  'PL10'},
 'Pectobacterium': {'AA10',
  'CBM13',
  'CBM18',
  'CBM3',
  'CBM67',
  'GH108',
  'GH12',
  'GH121',
  'GH146',
  'GH153',
  'GH154',
  'GH18',
  'GH38',
  'GH65',
  'GH68',
  'GT101',
  'GT102',
  'GT11',
  'GT111',
  'GT14',
  'GT20',
  'GT24',
  'GT32',
  'GT52',
  'GT73',
  'PL11'}}

4. The Core CAZome¶

Identify CAzy families that are present in every genome in the dataset using identify_core_cazome(), which takes the dataframe of CAZy family frequencies (with only CAZy families included in the columns, i.e no taxonomy columns). These families form the 'core CAZome'.

In [52]:
# make output directory
# make output dir for results, will not delete data if dir already exists
make_output_directory(Path('../results/pectobact/core_cazome/'), force=True, nodelete=True)
Output directory ../results/pectobact/core_cazome exists, nodelete is True. Adding output to output directory.
In [53]:
fam_freq_filtered_df_ggs = fam_freq_filtered_df.set_index(['Genome', 'Genus', 'Species'])
In [54]:
core_cazome = identify_core_cazome(fam_freq_filtered_df_ggs)

core_cazome = list(core_cazome)
core_cazome.sort()

print(f"Total families: {len(all_families)}")
print("The core CAZy families are:")
for fam in core_cazome:
    print('-', fam)
Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 8231.44it/s]
Total families: 117
The core CAZy families are:
- CBM5
- CBM50
- GH23
- GH3
- GT2
- GT51
- GT9

The boxplot shows the frequency of each CAZy family across all genomes in the dataframe. We can also break down this data by genus, and build a dataframe of Family, Genus (or tax rank of choice), genome, and frequency.

This dataframe can then be used to build a second dataframe of:

  • Family
  • Tax rank
  • Mean frequency
  • SD frequency Which can be presented as is in a report, or imported into RawGraphs to build a matrix plot (aka a proporitonal area plot).
In [55]:
# filter the famil freq df to include only those families in the core CAZome
core_cazome_df = fam_freq_filtered_df_ggs[core_cazome]
plot_fam_boxplot(core_cazome_df, font_scale=0.8, fig_size=(12,6))

The boxplot shows the frequency of each core CAZy family across all Pectobacteriaceae. To break down the frequency by genus, build a dataframe with the mean (and SD) of frequency of each family in the core CAZome per genus. This dataframe can then be used to plot a proportional area plot of the mean frequency of each CAZy family per genus, for exampling using RawGraphs.

In [56]:
help(build_fam_mean_freq_df)
Help on function build_fam_mean_freq_df in module cazomevolve.cazome.explore.cazy_families:

build_fam_mean_freq_df(df, grp, round_by=None)
    Build two dataframes of fam frequencies from a wide fam freq df
    
    DF 1: Family, tax rank (i.e. group), genome, freq
    DF 2: Family, tax rank (i.e. group), mean freq, sd freq
    
    :param df: pandas df, each row is a genome and each column a CAZy family
        and one column with tax rank listed (e.g. a 'Genus' column)
        and index includes the genomic accession
    :param grp: str, name of tax rank to group data by, and matches a name of one 
        of the columns in the dataframe (e.g. a 'Genus' column)
    :param round_by: int, number of decimal points to round by. If None, does not round
    
    Return two dataframes

In [57]:
core_cazome_df_genus = copy(core_cazome_df)  # to ensure core_cazome_df is not altereted
core_cazome_df_genus = add_tax_column_from_row_index(core_cazome_df_genus, 'Genus', 1)
core_cazome_df_genus.head()
Out[57]:
CBM5 CBM50 GH23 GH3 GT2 GT51 GT9 Genus
Genome Genus Species
GCA_009874285.1 Dickeya dianthicola 2 6 7 4 9 3 4 Dickeya
GCA_002307355.1 Pectobacterium polaris 1 6 4 2 9 3 4 Pectobacterium
GCA_016950075.1 Pectobacterium brasiliense 1 6 5 2 9 3 4 Pectobacterium
GCA_020295565.1 Pectobacterium versatile 1 6 7 2 7 3 3 Pectobacterium
GCA_024343475.1 Pectobacterium carotovorum subsp. carotovorum 1 6 6 2 4 3 4 Pectobacterium
In [58]:
core_cazome_fggf_df, core_cazome_mean_freq_df = build_fam_mean_freq_df(
    core_cazome_df_genus,
    'Genus',
    round_by=2,
)

# add rows showing the means across all pectobacteriaceae
all_pecto_core_fam_data = []
for fam in core_cazome_df_genus.columns:
    try:
        mean_freq = np.mean(core_cazome_df_genus[fam]).round(2)
        sd_freq = np.std(core_cazome_df_genus[fam]).round(2)
        all_pecto_core_fam_data.append([fam, 'Pectobacteriaceae', mean_freq, sd_freq])
    except TypeError: # tax column
        continue
    
temp_df = pd.DataFrame(all_pecto_core_fam_data, columns=['Family','Genus','MeanFreq','SdFreq'])
core_cazome_mean_freq_df = pd.concat([core_cazome_mean_freq_df, temp_df])

core_cazome_mean_freq_df.to_csv("../results/pectobact/core_cazome/core_cazome_freqs.csv")

core_cazome_mean_freq_df
Building [fam, grp, genome, freq] df: 100%|██████████| 711/711 [00:00<00:00, 4594.39it/s]
Building [Fam, grp, mean freq, sd freq] df: 100%|██████████| 8/8 [00:00<00:00, 118.12it/s]
Out[58]:
Family Genus MeanFreq SdFreq
0 CBM5 Acerihabitans 1.00 0.00
1 CBM50 Acerihabitans 6.00 0.00
2 GH23 Acerihabitans 8.00 0.00
3 GH3 Acerihabitans 2.00 0.00
4 GT2 Acerihabitans 12.00 0.00
... ... ... ... ...
2 GH23 Pectobacteriaceae 6.49 1.55
3 GH3 Pectobacteriaceae 2.49 0.60
4 GT2 Pectobacteriaceae 8.15 2.10
5 GT51 Pectobacteriaceae 3.08 0.37
6 GT9 Pectobacteriaceae 3.70 0.56

63 rows × 4 columns

Genus specific core CAZomes¶

As well as look at the core CAZome across all Pectobacteriaceae, identify the core CAZome of each genus. Generate a upsetplot to highlight the differences between the core CAZomes.

Note: Only looking at those genera that are represented by more than one genome, so that a core CAZome can be found. Otherwise, for genera with only one genome representative, all families in that one genome will be listed in the core CAZome.

In [59]:
genera_of_interest = ['Pectobacterium', 'Dickeya', 'Musicola', 'Brenneria', 'Lonsdalea']
all_families = fam_freq_filtered_df_ggs.columns

core_cazomes = {}  # {genus: {fams}}
for genus in genera_of_interest:
    filtered_df = fam_freq_filtered_df[fam_freq_filtered_df['Genus'] == genus]
    temp_core_cazome = identify_core_cazome(filtered_df[all_families])
    temp_core_cazome = list(temp_core_cazome)
    temp_core_cazome.sort()
    core_cazomes[genus] = {'fams': temp_core_cazome, 'freqs': {len(filtered_df)}}
    
core_cazomes
Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 9986.03it/s]
Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 13213.43it/s]
Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 16831.88it/s]
Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 16350.70it/s]
Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 16225.82it/s]
Out[59]:
{'Pectobacterium': {'fams': ['CBM5',
   'CBM50',
   'GH1',
   'GH103',
   'GH23',
   'GH28',
   'GH3',
   'GH43',
   'GT2',
   'GT51',
   'GT9',
   'PL1',
   'PL2',
   'PL22',
   'PL3',
   'PL9'],
  'freqs': {426}},
 'Dickeya': {'fams': ['CBM48',
   'CBM5',
   'CBM50',
   'CE4',
   'CE8',
   'GH1',
   'GH103',
   'GH105',
   'GH13',
   'GH23',
   'GH28',
   'GH3',
   'GH33',
   'GH73',
   'GH77',
   'GH8',
   'GT1',
   'GT19',
   'GT2',
   'GT28',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT9',
   'PL1',
   'PL9'],
  'freqs': {206}},
 'Musicola': {'fams': ['CBM48',
   'CBM5',
   'CBM50',
   'CE1',
   'CE11',
   'CE12',
   'CE4',
   'CE8',
   'CE9',
   'GH1',
   'GH102',
   'GH103',
   'GH104',
   'GH105',
   'GH13',
   'GH19',
   'GH2',
   'GH23',
   'GH28',
   'GH3',
   'GH30',
   'GH31',
   'GH32',
   'GH33',
   'GH38',
   'GH5',
   'GH73',
   'GH77',
   'GH8',
   'GT0',
   'GT1',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT30',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT56',
   'GT81',
   'GT83',
   'GT9',
   'PL1',
   'PL2',
   'PL22',
   'PL9'],
  'freqs': {4}},
 'Brenneria': {'fams': ['CBM5',
   'CBM50',
   'CE11',
   'CE12',
   'CE9',
   'GH1',
   'GH102',
   'GH103',
   'GH13',
   'GH23',
   'GH28',
   'GH3',
   'GH32',
   'GH4',
   'GH68',
   'GH73',
   'GH94',
   'GT0',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT30',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT56',
   'GT8',
   'GT81',
   'GT84',
   'GT9'],
  'freqs': {33}},
 'Lonsdalea': {'fams': ['CBM32',
   'CBM5',
   'CBM50',
   'CE11',
   'CE4',
   'GH19',
   'GH23',
   'GH3',
   'GH32',
   'GH37',
   'GH68',
   'GH77',
   'GH8',
   'GT19',
   'GT2',
   'GT20',
   'GT26',
   'GT28',
   'GT4',
   'GT51',
   'GT56',
   'GT9'],
  'freqs': {39}}}
In [60]:
core_cazome_upsetplot_membership = []
core_cazome_upsetplot_membership = add_to_upsetplot_membership(
    core_cazome_upsetplot_membership,
    core_cazomes,
)
len(core_cazome_upsetplot_membership)
Out[60]:
708
In [61]:
core_cazome_upsetplot = build_upsetplot(
    core_cazome_upsetplot_membership,
    sort_by='input',
    file_path='../results/pectobact/core_cazome/genera_core_cazome.svg',
)

Identify the core CAZome of soft and hard plant tissue targeting genera:

In [62]:
soft_genera = ['Pectobacterium', 'Dickeya', 'Musicola']
hard_genera = ['Brenneria', 'Lonsdalea', 'Samsonia', 'Affinibrenneria', 'Acerihabitans']
grps = [[soft_genera, 'Soft tissue targeting'], [hard_genera, 'Hard tissue targeting']]

all_families = fam_freq_filtered_df_ggs.columns

soft_hard_core_cazomes = {}  # {grp: {fams}}
for grp in tqdm(grps):
    # gather all rows containing the genera of interest
    filtered_df = fam_freq_filtered_df[fam_freq_filtered_df['Genus'].isin(grp[0])]
    temp_core_cazome = identify_core_cazome(filtered_df[all_families])
    temp_core_cazome = list(temp_core_cazome)
    temp_core_cazome.sort()
    try:
        soft_hard_core_cazomes[grp[1]]
    except KeyError:
        soft_hard_core_cazomes[grp[1]] = {'fams': set(), 'freqs': [0]}

    soft_hard_core_cazomes[grp[1]]['fams'] = soft_hard_core_cazomes[grp[1]]['fams'].union(
        set(temp_core_cazome)
    )
    soft_hard_core_cazomes[grp[1]]['freqs'][0] += len(filtered_df)
    
soft_hard_core_cazomes
  0%|          | 0/2 [00:00<?, ?it/s]
Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 8131.06it/s]

Identifying core CAZome: 100%|██████████| 117/117 [00:00<00:00, 13818.03it/s]
Out[62]:
{'Soft tissue targeting': {'fams': {'CBM5',
   'CBM50',
   'GH1',
   'GH103',
   'GH23',
   'GH28',
   'GH3',
   'GT2',
   'GT51',
   'GT9',
   'PL1',
   'PL9'},
  'freqs': [636]},
 'Hard tissue targeting': {'fams': {'CBM5',
   'CBM50',
   'CE11',
   'GH23',
   'GH3',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT4',
   'GT51',
   'GT56',
   'GT9'},
  'freqs': [75]}}
In [63]:
soft_hard_core_cazomes.update(core_cazomes)
In [64]:
core_cazome_upsetplot_membership = []
core_cazome_upsetplot_membership = add_to_upsetplot_membership(
    core_cazome_upsetplot_membership,
    soft_hard_core_cazomes,
)
len(core_cazome_upsetplot_membership)
Out[64]:
1419
In [65]:
core_cazome_upsetplot = build_upsetplot(
    core_cazome_upsetplot_membership,
    file_path='../results/pectobact/core_cazome/genera_soft_hard_core_cazome.svg',
)

5. Families that always occur together¶

Identify CAZy families that are always present in the genome together - this approach does not tolerate one CAZy family ever appearing without the other family in the same genome.

In [66]:
# make output dir for results, will not delete data if dir already exists
make_output_directory(Path('../results/pectobact/cooccurring_families/'), force=True, nodelete=True)

# reminder of the structure of the df
fam_freq_filtered_df.head(1)
Output directory ../results/pectobact/cooccurring_families exists, nodelete is True. Adding output to output directory.
Out[66]:
Genome Genus Species AA10 AA3 CBM0 CBM13 CBM18 CBM3 CBM32 ... PL11 PL17 PL2 PL22 PL26 PL3 PL35 PL38 PL4 PL9
0 GCA_009874285.1 Dickeya dianthicola 0 0 0 0 0 0 0 ... 0 0 1 1 1 2 0 0 2 3

1 rows × 120 columns

Using a correlation matrix:

CAZy families that always appear together can be identified by generating a correlation matrix using the Python package pandas, CAZy families that are always present together will have a correlation matrix of 1.

This can be done using the identify_cooccurring_fams_corrM() function. CAZy families that are always present in the genome (i.e. the core CAZome), or are absent from all genomes will be calulcated to have a correlation score of nan. In order to plot the correlation matrix, the fill_value key word for identify_cooccurring_fams_corrM() can be used to replace all nan values with an interger.

identify_cooccurring_fams_corrM() returns a correlation matrix and ...

In [67]:
all_families = list(fam_freq_filtered_df.columns[3:])

cooccurring_families, fam_corr_M_filled = identify_cooccurring_fams_corrM(
    fam_freq_filtered_df,
    all_families,
    core_cazome=[],
    corrM_path="../results/pectobact/cooccurring_families/fam_corr_M_filled.csv",
    fill_value=2,
)
Building binary fam freq df: 100%|██████████| 117/117 [00:00<00:00, 1816.42it/s]
Delete absent families: 100%|██████████| 117/117 [00:00<00:00, 8483.60it/s]
Identifying always co-occurring families: 100%|██████████| 117/117 [00:00<00:00, 2110.13it/s]
In [68]:
cooccurring_families
Out[68]:
{('CBM4', 'GH148'), ('GH121', 'GH146'), ('GH127', 'GH15'), ('GH94', 'GT84')}

Generate a clustermap of the correlation matrix.

In [69]:
sns.clustermap(
    fam_corr_M_filled,
    cmap=sns.cubehelix_palette(rot=0, dark=2, light=0, reverse=True, as_cmap=True),
)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/home/emmah/.conda/envs/pectobacteriaceae/lib/python3.9/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[69]:
<seaborn.matrix.ClusterGrid at 0x7fbdff2c27f0>

An iterative approach to identify co-occurring families:

Iterate through the dataframe of CAZy family frequencies in Pectobacteriaceae (fam_freq_df_filtered) and identify the groups of always co-occurring CAZy families (i.e. those families that are always present together) and count the number of genomes in which the families are present together.

This is done using the cazomevolve function calc_cooccuring_fam_freqs, which returns a dictionary of groups of co-occurring CAZy families. The function takes as input:

  1. The dataframe of CAZy family frequencies (it can include taxonomy information in columns)
  2. A list of the CAZy families to analyse
  3. (Optional) whether to include or exclude the core CAZome from the list of always co-occurring CAZy families.
In [70]:
cooccurring_fams_dict = calc_cooccuring_fam_freqs(
    fam_freq_filtered_df,
    list(all_families),
    exclude_core_cazome=False,
)
cooccurring_fams_dict
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:01<00:00, 65.69it/s]
Combining pairs of co-occurring families: 100%|██████████| 25/25 [00:00<00:00, 82048.20it/s]
Out[70]:
{0: {'fams': {'CBM4', 'GH148'}, 'freqs': {8}},
 1: {'fams': {'CBM5', 'CBM50', 'GH23', 'GH3', 'GT2', 'GT51', 'GT9'},
  'freqs': {711}},
 2: {'fams': {'GH121', 'GH146'}, 'freqs': {1}},
 3: {'fams': {'GH127', 'GH15'}, 'freqs': {1}},
 4: {'fams': {'GH94', 'GT84'}, 'freqs': {309}}}

For each Pectobacteriaceae genus, identify the groups of always co-occurring CAZy families.

Note: Limit the analysis to only those genera represented by more than one genome. Looking at genera where only one genome was analysed, all families in the genome will be listed as always co-occurring.

In [71]:
genera_cooccuring_fams = {}  # {genus: cooccurring_fams_dict}

for genus in tqdm(
    ['Pectobacterium', 'Dickeya', 'Musicola', 'Lonsdalea', 'Brenneria'],
    desc="Identifying genus specific co-occurring fams",
):
    genus_fam_freq_df = fam_freq_filtered_df[fam_freq_filtered_df['Genus'] == genus]
    genus_cooccurring_fams_dict = calc_cooccuring_fam_freqs(
        genus_fam_freq_df,
        list(all_families),
        exclude_core_cazome=False,
    )
    genera_cooccuring_fams[genus] = genus_cooccurring_fams_dict
genera_cooccuring_fams
Identifying genus specific co-occurring fams:   0%|          | 0/5 [00:00<?, ?it/s]
Identifying pairs of co-occurring families:   0%|          | 0/117 [00:00<?, ?it/s]
Identifying pairs of co-occurring families:   9%|▊         | 10/117 [00:00<00:01, 99.97it/s]
Identifying pairs of co-occurring families:  18%|█▊        | 21/117 [00:00<00:00, 99.89it/s]
Identifying pairs of co-occurring families:  26%|██▋       | 31/117 [00:00<00:00, 93.04it/s]
Identifying pairs of co-occurring families:  38%|███▊      | 45/117 [00:00<00:00, 109.61it/s]
Identifying pairs of co-occurring families:  50%|█████     | 59/117 [00:00<00:00, 118.69it/s]
Identifying pairs of co-occurring families:  61%|██████    | 71/117 [00:00<00:00, 115.15it/s]
Identifying pairs of co-occurring families:  71%|███████   | 83/117 [00:00<00:00, 112.73it/s]
Identifying pairs of co-occurring families:  81%|████████  | 95/117 [00:00<00:00, 109.81it/s]
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:01<00:00, 110.20it/s]

Combining pairs of co-occurring families: 100%|██████████| 135/135 [00:00<00:00, 275270.32it/s]

Identifying pairs of co-occurring families:   0%|          | 0/117 [00:00<?, ?it/s]
Identifying pairs of co-occurring families:  19%|█▉        | 22/117 [00:00<00:00, 211.94it/s]
Identifying pairs of co-occurring families:  42%|████▏     | 49/117 [00:00<00:00, 244.98it/s]
Identifying pairs of co-occurring families:  63%|██████▎   | 74/117 [00:00<00:00, 233.94it/s]
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:00<00:00, 230.59it/s]

Combining pairs of co-occurring families: 100%|██████████| 361/361 [00:00<00:00, 595416.34it/s]

Identifying pairs of co-occurring families:   0%|          | 0/117 [00:00<?, ?it/s]
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:00<00:00, 1127.91it/s]

Combining pairs of co-occurring families: 100%|██████████| 1130/1130 [00:00<00:00, 792701.71it/s]

Identifying pairs of co-occurring families:   0%|          | 0/117 [00:00<?, ?it/s]
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:00<00:00, 967.37it/s][A

Combining pairs of co-occurring families: 100%|██████████| 255/255 [00:00<00:00, 592875.57it/s]

Identifying pairs of co-occurring families:   0%|          | 0/117 [00:00<?, ?it/s]
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:00<00:00, 616.17it/s][A

Combining pairs of co-occurring families: 100%|██████████| 500/500 [00:00<00:00, 679789.95it/s]
Out[71]:
{'Pectobacterium': {0: {'fams': {'CBM3', 'GH5'}, 'freqs': {425}},
  1: {'fams': {'CBM48', 'CE8', 'CE9', 'GH13'}, 'freqs': {425}},
  2: {'fams': {'CBM5',
    'CBM50',
    'GH1',
    'GH103',
    'GH23',
    'GH28',
    'GH3',
    'GH43',
    'GT2',
    'GT51',
    'GT9',
    'PL1',
    'PL2',
    'PL22',
    'PL3',
    'PL9'},
   'freqs': {426}},
  3: {'fams': {'CE11', 'GH102', 'GH32'}, 'freqs': {425}},
  4: {'fams': {'GH105', 'GT56'}, 'freqs': {425}},
  5: {'fams': {'GH121', 'GH146', 'GH154'}, 'freqs': {1}},
  6: {'fams': {'GH94', 'GT84'}, 'freqs': {152}}},
 'Dickeya': {0: {'fams': {'CBM4', 'GH148'}, 'freqs': {8}},
  1: {'fams': {'CBM48',
    'CBM5',
    'CBM50',
    'CE4',
    'CE8',
    'GH1',
    'GH103',
    'GH105',
    'GH13',
    'GH23',
    'GH28',
    'GH3',
    'GH33',
    'GH73',
    'GH77',
    'GH8',
    'GT1',
    'GT19',
    'GT2',
    'GT28',
    'GT35',
    'GT4',
    'GT5',
    'GT51',
    'GT9',
    'PL1',
    'PL9'},
   'freqs': {206}},
  2: {'fams': {'CE11', 'GT83'}, 'freqs': {204}},
  3: {'fams': {'GH16', 'GT25'}, 'freqs': {1}},
  4: {'fams': {'GH19', 'GH5', 'PL4'}, 'freqs': {203}},
  5: {'fams': {'GH88', 'PL35'}, 'freqs': {3}},
  6: {'fams': {'GH94', 'GT84'}, 'freqs': {89}},
  7: {'fams': {'GT30', 'PL3'}, 'freqs': {205}},
  8: {'fams': {'PL2', 'PL22'}, 'freqs': {205}}},
 'Musicola': {0: {'fams': {'CBM32', 'CBM63'}, 'freqs': {2}},
  1: {'fams': {'CBM48',
    'CBM5',
    'CBM50',
    'CE1',
    'CE11',
    'CE12',
    'CE4',
    'CE8',
    'CE9',
    'GH1',
    'GH102',
    'GH103',
    'GH104',
    'GH105',
    'GH13',
    'GH19',
    'GH2',
    'GH23',
    'GH28',
    'GH3',
    'GH30',
    'GH31',
    'GH32',
    'GH33',
    'GH38',
    'GH5',
    'GH73',
    'GH77',
    'GH8',
    'GT0',
    'GT1',
    'GT19',
    'GT2',
    'GT26',
    'GT28',
    'GT30',
    'GT35',
    'GT4',
    'GT5',
    'GT51',
    'GT56',
    'GT81',
    'GT83',
    'GT9',
    'PL1',
    'PL2',
    'PL22',
    'PL9'},
   'freqs': {4}},
  2: {'fams': {'GH24', 'GH36'}, 'freqs': {2}}},
 'Lonsdalea': {0: {'fams': {'CBM32',
    'CBM5',
    'CBM50',
    'CE11',
    'CE4',
    'GH19',
    'GH23',
    'GH3',
    'GH32',
    'GH37',
    'GH68',
    'GH77',
    'GH8',
    'GT19',
    'GT2',
    'GT20',
    'GT26',
    'GT28',
    'GT4',
    'GT51',
    'GT56',
    'GT9'},
   'freqs': {39}},
  1: {'fams': {'GH1', 'GH28', 'GH4', 'GH73', 'GT0'}, 'freqs': {38}},
  2: {'fams': {'GH13', 'GH39', 'GT30', 'PL1', 'PL3'}, 'freqs': {38}},
  3: {'fams': {'GH26', 'GH51'}, 'freqs': {9}},
  4: {'fams': {'GH31', 'GT81'}, 'freqs': {38}},
  5: {'fams': {'GH78', 'GT1'}, 'freqs': {10}},
  6: {'fams': {'GH94', 'GT84'}, 'freqs': {33}}},
 'Brenneria': {0: {'fams': {'CBM3', 'GH5'}, 'freqs': {25}},
  1: {'fams': {'CBM5',
    'CBM50',
    'CE11',
    'CE12',
    'CE9',
    'GH1',
    'GH102',
    'GH103',
    'GH13',
    'GH23',
    'GH28',
    'GH3',
    'GH32',
    'GH4',
    'GH68',
    'GH73',
    'GH94',
    'GT0',
    'GT19',
    'GT2',
    'GT26',
    'GT28',
    'GT30',
    'GT35',
    'GT4',
    'GT5',
    'GT51',
    'GT56',
    'GT8',
    'GT81',
    'GT84',
    'GT9'},
   'freqs': {33}},
  2: {'fams': {'GH106', 'PL38'}, 'freqs': {1}},
  3: {'fams': {'GH8', 'GT83'}, 'freqs': {15}},
  4: {'fams': {'GT73', 'PL17'}, 'freqs': {1}}}}

Identify families that always co-occurring in soft and hard plant tissue genera.

In [72]:
soft_genera = ['Pectobacterium', 'Dickeya', 'Musicola']
hard_genera = ['Brenneria', 'Lonsdalea', 'Samsonia', 'Affinibrenneria', 'Acerihabitans']
# hard_genera = ['Brenneria', 'Lonsdalea']
grps = [[soft_genera, 'Soft tissue targeting'], [hard_genera, 'Hard tissue targeting']]

for grp in tqdm(grps):
    # gather all rows containing the genera of interest
    grp_fam_freq_df = fam_freq_filtered_df[fam_freq_filtered_df['Genus'].isin(grp[0])]
    
    grp_cooccurring_fams_dict = calc_cooccuring_fam_freqs(
        grp_fam_freq_df,
        list(all_families),
        exclude_core_cazome=False,
    )
    genera_cooccuring_fams[grp[1]] = grp_cooccurring_fams_dict
  0%|          | 0/2 [00:00<?, ?it/s]
Identifying pairs of co-occurring families:   0%|          | 0/117 [00:00<?, ?it/s]
Identifying pairs of co-occurring families:   6%|▌         | 7/117 [00:00<00:01, 67.92it/s]
Identifying pairs of co-occurring families:  12%|█▏        | 14/117 [00:00<00:01, 67.86it/s]
Identifying pairs of co-occurring families:  19%|█▉        | 22/117 [00:00<00:01, 69.72it/s]
Identifying pairs of co-occurring families:  26%|██▌       | 30/117 [00:00<00:01, 72.83it/s]
Identifying pairs of co-occurring families:  35%|███▌      | 41/117 [00:00<00:00, 82.58it/s]
Identifying pairs of co-occurring families:  43%|████▎     | 50/117 [00:00<00:00, 77.69it/s]
Identifying pairs of co-occurring families:  51%|█████▏    | 60/117 [00:00<00:00, 82.15it/s]
Identifying pairs of co-occurring families:  59%|█████▉    | 69/117 [00:00<00:00, 81.75it/s]
Identifying pairs of co-occurring families:  67%|██████▋   | 78/117 [00:01<00:00, 79.22it/s]
Identifying pairs of co-occurring families:  74%|███████▍  | 87/117 [00:01<00:00, 79.44it/s]
Identifying pairs of co-occurring families:  81%|████████  | 95/117 [00:01<00:00, 76.31it/s]
Identifying pairs of co-occurring families:  88%|████████▊ | 103/117 [00:01<00:00, 75.07it/s]
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:01<00:00, 76.74it/s]

Combining pairs of co-occurring families: 100%|██████████| 75/75 [00:00<00:00, 389322.77it/s]

Identifying pairs of co-occurring families:   0%|          | 0/117 [00:00<?, ?it/s]
Identifying pairs of co-occurring families:  38%|███▊      | 44/117 [00:00<00:00, 435.48it/s]
Identifying pairs of co-occurring families: 100%|██████████| 117/117 [00:00<00:00, 413.01it/s][A

Combining pairs of co-occurring families: 100%|██████████| 91/91 [00:00<00:00, 369846.57it/s]

Build an upset plot of co-occurring CAZy families¶

Build an upsetplot (using the Python package upsetplot) to visulise the groups of always co-occurring CAZy families, additionally it will plot the number of genomes in which each group of co-occurring CAZy families were present.

First compile the data/membership for the upset plot by:

  1. Creating an empty list to store the upset plot data
  2. Adding to the empty list the data contained in each dictionary of co-occurring CAZy families by using the add_to_upsetplot_membership() function
In [73]:
upsetplot_membership = []
upsetplot_membership = add_to_upsetplot_membership(upsetplot_membership, cooccurring_fams_dict)

for genus in genera_cooccuring_fams:
    upsetplot_membership = add_to_upsetplot_membership(
        upsetplot_membership,
        genera_cooccuring_fams[genus],
    )

len(upsetplot_membership)
Out[73]:
7233

Build the upset plot. This will include the core CAZomes across Pectobacteriaceae, per genus, and per all soft plant tissue targeting genera and all hard plant tissue targeting genera.

In [74]:
help(build_upsetplot)
Help on function build_upsetplot in module cazomevolve.cazome.explore.cooccurring_families:

build_upsetplot(upsetplot_membership, file_path=None, file_format='svg', sort_by='degree', sort_categories_by='cardinality')
    Use the upsetplot package to build an upsetplot of co-occurring families
    
    :param upsetplot_membership: list of lists, one nested list per instance of co-occurring families group
    :param file_path, str/Path, path to write out figure. If none, file is not written out
    :param file_format: str, format to write out file, e.g. svg or png, default, svg
    :param sort_by: str, method to sort subsets 
        From Upsetplot:
            sort_by : {'cardinality', 'degree', '-cardinality', '-degree',
                    'input', '-input'}
                If 'cardinality', subset are listed from largest to smallest.
                If 'degree', they are listed in order of the number of categories
                intersected. If 'input', the order they appear in the data input is
                used.
                Prefix with '-' to reverse the ordering.
    
                Note this affects ``subset_sizes`` but not ``data``.
    :param sort_categories_by: str, 
        From UpsetPlot:
            sort_categories_by : {'cardinality', '-cardinality', 'input', '-input'}
                Whether to sort the categories by total cardinality, or leave them
                in the input data's provided order (order of index levels).
                Prefix with '-' to reverse the ordering.
    
    Return upsetplot

In [75]:
pectobact_upsetplot = build_upsetplot(
    upsetplot_membership,
    file_path='../results/pectobact/cooccurring_families/pecto-cooccurring-families.svg',
)

Break down the incidences per genus:

The upset plot generates a bar chart showing the number of genomes that each group of co-occuring CAZy families appeared in. However, this plots the total number across each of the groups (i.e. Pectobacterium, Dickeya, etc.).

To break down the indidence (i.e. the number of genomes that each group of co-occurring CAZy families were present in) per group, a dataframe listing each group of co-occurring CAZy families, the group (i.e. genus), and the respective frequency must be generated. This dataframe can then be used to generate a proportional area plot (or matrix plot), breaking down the incidence per group (i.e. genus).

The groups of co-occurring CAZy families must be listed in the same order as they are presented in the upset plot.

In [76]:
upset_plot_groups = get_upsetplot_grps(upsetplot_membership)
100%|██████████| 38/38 [00:01<00:00, 22.17it/s]

Compiling the data of the incidence of each grp of co-occurring CAZy families per group of interest (e.g. per genus), into a single dataframe.

Create an empty list to store all data for the dataframe, then use add_upsetplot_grp_freqs to add data of the incidence per group of co-occurring CAZy families to the list. build_upsetplot_matrix is then used to build the dataframe.

In [77]:
help(add_upsetplot_grp_freqs)
Help on function add_upsetplot_grp_freqs in module cazomevolve.cazome.explore.cooccurring_families:

add_upsetplot_grp_freqs(upset_plt_groups, cooccurring_grp_freq_data, cooccurring_fam_dict, grp, grp_sep=False, grp_order=None, include_none=False)
    Add data on the incidence of co-occurring grps of CAZy families from the
    cooccurring_fam_dict to cooccurring_grp_freq_data
    
    :param upset_plt_groups: list of lists, one nested list per grp of co-occurring CAZy families
        grps listed in same order as present in the upsetplot
    :param cooccurring_grp_freq_data: list of lists, one nested list per 
        pair of 'grp' and grp of co-occurring CAZy families
    :param cooccurring_fam_dict: dict, {grp_num: {'fams': {families}, 'freqs': {freqs/incidences}}}
    :param grp: str, name of grp to be added to cooccurring_grp_freq_data, e.g. the name of the genus
    :param grp_sep: bool, does the cooccurring_fam_dict contain data separated into grps, e.g. by genus
        {grp(e.g. genus): {grp_num: {'fams': {families}, 'freqs': {freqs/incidences}}}}
    :param grp_order: list of grp names, order to list through grps if grp_sep is True. If None, uses
        order groups are listed in cooccurring_fam_dict
    :param include_none: bool, if True, if a grp of fams is not in the cooccurring_fam_dict, leaves the 
        freq as None. If false, the grp of fams is not added to upset_plt_groups
    
    Return cooccurring_grp_freq_data

In [78]:
cooccurring_grp_freq_data = []  # empty list to store data for the df

# add pectobacteriaceae data
genera_cooccuring_fams['Pectobacteriaceae'] = cooccurring_fams_dict

# add data for each genus, all soft plant targeting and hard plant tissue targeting
cooccurring_grp_freq_data = add_upsetplot_grp_freqs(
    upset_plot_groups,
    cooccurring_grp_freq_data,
    genera_cooccuring_fams,
    genus,
    grp_sep=True,
    grp_order=[
        'Pectobacteriaceae', 
        'Pectobacterium', 'Dickeya', 'Musicola', 'Soft tissue targeting',
        'Brenneria', 'Lonsdalea', 'Hard tissue targeting',
    ],
    include_none=True,
)
Compiling co-occurring families incidence data: 100%|██████████| 38/38 [00:00<00:00, 26891.10it/s]

Build a single dataframe of co-occurring families, freq and group (e.g. genus).

But also list the information for each group in the same order the groups of CAZy families are listed in the upset plot. This allows a proportional area plot to be generated (for example, by using RawGraphs), which can then be combined with the upset plot (for example, using inkscape).

In [79]:
help(build_upsetplot_matrix)
Help on function build_upsetplot_matrix in module cazomevolve.cazome.explore.cooccurring_families:

build_upsetplot_matrix(cooccurring_grp_freq_data, grp, file_path=None)
    Build matrix of grp of CAZy families, grp of interest name (e.g. genus) and incidence 
    (i.e. the number of genomes that the grp of CAZy families appeared in)
    
    :param cooccurring_grp_freq_data: list of lists, one nested list per row in the df
    :param grp: str, name of grouping, i.e. the method used to group the genomes,
        .e.g. 'Genus', or 'Species'
    :param file_path: str/Path, path to write out CSV file. If none, the file is not 
        written to file
        
    Return df

In [80]:
# build the dataframe
cooccurring_fams_freq_df = build_upsetplot_matrix(
    cooccurring_grp_freq_data,
    'Genus',
    file_path='../results/pectobact/cooccurring_families/cooccurring_fams_freqs.csv',
)
cooccurring_fams_freq_df
Out[80]:
Families Genus Incidence
0 PL2+PL22 Pectobacteriaceae NaN
1 PL2+PL22 Pectobacterium NaN
2 PL2+PL22 Dickeya 205.0
3 PL2+PL22 Musicola NaN
4 PL2+PL22 Soft tissue targeting 635.0
... ... ... ...
299 GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28... Musicola 4.0
300 GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28... Soft tissue targeting NaN
301 GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28... Brenneria NaN
302 GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28... Lonsdalea NaN
303 GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28... Hard tissue targeting NaN

304 rows × 3 columns

Generate figure for paper¶

After analysing the data, mannually group of the soft and hard tissue targeting specific groups of CAZy families together and mannually define the order the groups are presented in the final upset plot (by setting the param sort_by to 'input').

In [81]:
set(pd.read_csv('../results/pectobact/cooccurring_families/cooccurring_fams_freqs.csv')['Families'])
Out[81]:
{'CBM32+CBM63',
 'CBM67+GH65',
 'CE11+GH32+GH102',
 'CE11+GT83',
 'GH1+GH28+GH73+GT0+GH4',
 'GH1+GH73+GT0',
 'GH105+GT56',
 'GH121+GH146',
 'GH121+GH146+GH154',
 'GH13+CBM48+CE8',
 'GH13+CBM48+CE8+CE9',
 'GH13+GT30',
 'GH13+PL1+PL3+GT30+GH39',
 'GH148+CBM4',
 'GH15+GH127',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+CE11+GT56+GH32+GT4+GT28+GT19+GH8+CE4+GH77+GH19+GT26+GH68+CBM32+GT20+GH37',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+CE11+GT56+GT4+GT28+GT19+GT26',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28+GH103+GT84+GH94+CE11+GT56+GH32+GH102+CE9+GT4+GT28+GT19+GH73+GT30+GT5+GT35+GT26+GT0+GT81+GH68+GH4+GT8+CE12',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28+PL1+GH103+PL9+CBM48+CE8+GH105+GT4+GT28+GT19+GH73+GT5+GT35+GH8+CE4+GH77+GT1+GH33',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28+PL1+GH103+PL9+PL2+PL22+CBM48+CE8+CE11+GH5+GH105+GT56+GH32+GH102+CE9+GT4+GT28+GT19+GH73+GT30+GT5+GT35+GH8+CE4+GH77+GH19+GT83+GT1+GH33+GT26+GT0+GT81+GH31+CE12+GH38+GH30+CE1+GH2+GH104',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH28+PL1+GH103+PL9',
 'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH28+PL1+GH103+PL9+PL2+PL22+PL3+GH43',
 'GH24+GH36',
 'GH26+GH51',
 'GH5+CBM3',
 'GH5+GH19+PL4',
 'GH8+GT83',
 'GT1+GH78',
 'GT25+GH16',
 'GT5+GT35+GT8',
 'GT81+GH31',
 'GT84+GH94',
 'PL17+GT73',
 'PL2+PL22',
 'PL3+GT30',
 'PL35+GH88',
 'PL38+GH106'}
In [82]:
grp_order = {
    'soft_grps': [  # grps only found in soft plant tissue targeting genomes
        'GH13+CBM48+CE8', # S
        'PL2+PL22', # # S D
        'GH148+CBM4', # S D
        'PL3+GT30', # D
        'PL35+GH88', # D
        'CE11+GT83', # D
        'GT25+GH16', # D
        'GH5+GH19+PL4', # D
        'GH121+GH146+GH154', # S P   
        'GH121+GH146',
        'GH105+GT56', # P
        'GH13+CBM48+CE8+CE9',  # P
        'CE11+GH32+GH102', # P
        
    ],
    'musicola': [  # grps found only in musicola
        'CBM32+CBM63',
        'GH24+GH36',
    ],
    'both_grps': [  # grps found in soft and hard plant tissue targeting genomes
        'GT84+GH94', # 
        'GH5+CBM3', #
    ],
    'hard_musicola_grps': [  # grps only found in hard plant tissue targeting genomes and Musicola
        'GH13+GT30',
        'GH1+GH73+GT0',
        'GT5+GT35+GT8',
        'GH15+GH127',
        'GT81+GH31', # L
        'GH8+GT83', # B
    ],
    'hard_grps': [ # grps only found in hard plant tissue targeting genomes
        'CBM67+GH65', # H
        'PL17+GT73', # L B
        'PL38+GH106', # L B
        'GT1+GH78', # L
        'GH26+GH51', # L
        'GH1+GH28+GH73+GT0+GH4',
        'GH13+PL1+PL3+GT30+GH39',
    ],
    'all_core_cazomes': [ # then core cazomes at the end
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9',  # pectobacteriaceae
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH28+PL1+GH103+PL9',  # soft plant tissue targeting
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28+PL1+GH103+PL9+CBM48+CE8+GH105+GT4+GT28+GT19+GH73+GT5+GT35+GH8+CE4+GH77+GT1+GH33',   # dickeya
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH28+PL1+GH103+PL9+PL2+PL22+PL3+GH43', # pectobacter 
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28+PL1+GH103+PL9+PL2+PL22+CBM48+CE8+CE11+GH5+GH105+GT56+GH32+GH102+CE9+GT4+GT28+GT19+GH73+GT30+GT5+GT35+GH8+CE4+GH77+GH19+GT83+GT1+GH33+GT26+GT0+GT81+GH31+CE12+GH38+GH30+CE1+GH2+GH104',  # musicola
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+CE11+GT56+GT4+GT28+GT19+GT26',  # hard
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+CE11+GT56+GH32+GT4+GT28+GT19+GH8+CE4+GH77+GH19+GT26+GH68+CBM32+GT20+GH37',   # lonsdalea
        'GH23+GH3+CBM5+CBM50+GT2+GT51+GT9+GH1+GH13+GH28+GH103+GT84+GH94+CE11+GT56+GH32+GH102+CE9+GT4+GT28+GT19+GH73+GT30+GT5+GT35+GT26+GT0+GT81+GH68+GH4+GT8+CE12', # bren
    ],
}
for grp in grp_order:
    grps = []
    for fams_str in grp_order[grp]:
        fams_list = fams_str.split("+")
        fams_list.sort()
        fams = "+".join(fams_list)
        grps.append(fams)
    grp_order[grp] = grps
grp_order
Out[82]:
{'soft_grps': ['CBM48+CE8+GH13',
  'PL2+PL22',
  'CBM4+GH148',
  'GT30+PL3',
  'GH88+PL35',
  'CE11+GT83',
  'GH16+GT25',
  'GH19+GH5+PL4',
  'GH121+GH146+GH154',
  'GH121+GH146',
  'GH105+GT56',
  'CBM48+CE8+CE9+GH13',
  'CE11+GH102+GH32'],
 'musicola': ['CBM32+CBM63', 'GH24+GH36'],
 'both_grps': ['GH94+GT84', 'CBM3+GH5'],
 'hard_musicola_grps': ['GH13+GT30',
  'GH1+GH73+GT0',
  'GT35+GT5+GT8',
  'GH127+GH15',
  'GH31+GT81',
  'GH8+GT83'],
 'hard_grps': ['CBM67+GH65',
  'GT73+PL17',
  'GH106+PL38',
  'GH78+GT1',
  'GH26+GH51',
  'GH1+GH28+GH4+GH73+GT0',
  'GH13+GH39+GT30+PL1+PL3'],
 'all_core_cazomes': ['CBM5+CBM50+GH23+GH3+GT2+GT51+GT9',
  'CBM5+CBM50+GH1+GH103+GH23+GH28+GH3+GT2+GT51+GT9+PL1+PL9',
  'CBM48+CBM5+CBM50+CE4+CE8+GH1+GH103+GH105+GH13+GH23+GH28+GH3+GH33+GH73+GH77+GH8+GT1+GT19+GT2+GT28+GT35+GT4+GT5+GT51+GT9+PL1+PL9',
  'CBM5+CBM50+GH1+GH103+GH23+GH28+GH3+GH43+GT2+GT51+GT9+PL1+PL2+PL22+PL3+PL9',
  'CBM48+CBM5+CBM50+CE1+CE11+CE12+CE4+CE8+CE9+GH1+GH102+GH103+GH104+GH105+GH13+GH19+GH2+GH23+GH28+GH3+GH30+GH31+GH32+GH33+GH38+GH5+GH73+GH77+GH8+GT0+GT1+GT19+GT2+GT26+GT28+GT30+GT35+GT4+GT5+GT51+GT56+GT81+GT83+GT9+PL1+PL2+PL22+PL9',
  'CBM5+CBM50+CE11+GH23+GH3+GT19+GT2+GT26+GT28+GT4+GT51+GT56+GT9',
  'CBM32+CBM5+CBM50+CE11+CE4+GH19+GH23+GH3+GH32+GH37+GH68+GH77+GH8+GT19+GT2+GT20+GT26+GT28+GT4+GT51+GT56+GT9',
  'CBM5+CBM50+CE11+CE12+CE9+GH1+GH102+GH103+GH13+GH23+GH28+GH3+GH32+GH4+GH68+GH73+GH94+GT0+GT19+GT2+GT26+GT28+GT30+GT35+GT4+GT5+GT51+GT56+GT8+GT81+GT84+GT9']}
In [83]:
paper_cooccurring_fams = {}  # {grp_num: {'fams': {fams}, 'freqs': {int}}}
num_of_grp = 0

for pheno_grp in grp_order:
    for fam_grp in grp_order[pheno_grp]:
        fams = fam_grp.split("+")
        fams.sort()
        
        for genus in ['Pectobacterium', 'Dickeya', 'Musicola', 'Soft tissue targeting', 'Hard tissue targeting', 'Brenneria', 'Lonsdalea']:
            # {grp num: {'fams': {fams}, 'freqs': {int}}}
            
            for grp_num in genera_cooccuring_fams[genus]:
                grp_fams = list(genera_cooccuring_fams[genus][grp_num]['fams'])
                grp_fams.sort()
                
                if grp_fams == fams:
                    
                    this_grp_num = None
                    
                    for co_grp_num in paper_cooccurring_fams:
                        if paper_cooccurring_fams[co_grp_num]['fams'] == genera_cooccuring_fams[genus][grp_num]['fams']:
                            this_grp_num = co_grp_num
                            
                    if this_grp_num is None:
                        this_grp_num = copy(num_of_grp)
                    
                        paper_cooccurring_fams[this_grp_num] = {
                            'fams': genera_cooccuring_fams[genus][grp_num]['fams'],
                            'freqs': genera_cooccuring_fams[genus][grp_num]['freqs']
                        }
                        
                        num_of_grp += 1
                    
paper_cooccurring_fams
Out[83]:
{0: {'fams': {'CBM48', 'CE8', 'GH13'}, 'freqs': {635}},
 1: {'fams': {'PL2', 'PL22'}, 'freqs': {205}},
 2: {'fams': {'CBM4', 'GH148'}, 'freqs': {8}},
 3: {'fams': {'GT30', 'PL3'}, 'freqs': {205}},
 4: {'fams': {'GH88', 'PL35'}, 'freqs': {3}},
 5: {'fams': {'CE11', 'GT83'}, 'freqs': {204}},
 6: {'fams': {'GH16', 'GT25'}, 'freqs': {1}},
 7: {'fams': {'GH19', 'GH5', 'PL4'}, 'freqs': {203}},
 8: {'fams': {'GH121', 'GH146', 'GH154'}, 'freqs': {1}},
 9: {'fams': {'GH105', 'GT56'}, 'freqs': {425}},
 10: {'fams': {'CBM48', 'CE8', 'CE9', 'GH13'}, 'freqs': {425}},
 11: {'fams': {'CE11', 'GH102', 'GH32'}, 'freqs': {425}},
 12: {'fams': {'CBM32', 'CBM63'}, 'freqs': {2}},
 13: {'fams': {'GH24', 'GH36'}, 'freqs': {2}},
 14: {'fams': {'GH94', 'GT84'}, 'freqs': {152}},
 15: {'fams': {'CBM3', 'GH5'}, 'freqs': {425}},
 16: {'fams': {'GH13', 'GT30'}, 'freqs': {74}},
 17: {'fams': {'GH1', 'GH73', 'GT0'}, 'freqs': {74}},
 18: {'fams': {'GT35', 'GT5', 'GT8'}, 'freqs': {36}},
 19: {'fams': {'GH127', 'GH15'}, 'freqs': {1}},
 20: {'fams': {'GH31', 'GT81'}, 'freqs': {38}},
 21: {'fams': {'GH8', 'GT83'}, 'freqs': {15}},
 22: {'fams': {'CBM67', 'GH65'}, 'freqs': {1}},
 23: {'fams': {'GT73', 'PL17'}, 'freqs': {1}},
 24: {'fams': {'GH106', 'PL38'}, 'freqs': {1}},
 25: {'fams': {'GH78', 'GT1'}, 'freqs': {10}},
 26: {'fams': {'GH26', 'GH51'}, 'freqs': {9}},
 27: {'fams': {'GH1', 'GH28', 'GH4', 'GH73', 'GT0'}, 'freqs': {38}},
 28: {'fams': {'GH13', 'GH39', 'GT30', 'PL1', 'PL3'}, 'freqs': {38}},
 29: {'fams': {'CBM5',
   'CBM50',
   'GH1',
   'GH103',
   'GH23',
   'GH28',
   'GH3',
   'GT2',
   'GT51',
   'GT9',
   'PL1',
   'PL9'},
  'freqs': {636}},
 30: {'fams': {'CBM48',
   'CBM5',
   'CBM50',
   'CE4',
   'CE8',
   'GH1',
   'GH103',
   'GH105',
   'GH13',
   'GH23',
   'GH28',
   'GH3',
   'GH33',
   'GH73',
   'GH77',
   'GH8',
   'GT1',
   'GT19',
   'GT2',
   'GT28',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT9',
   'PL1',
   'PL9'},
  'freqs': {206}},
 31: {'fams': {'CBM5',
   'CBM50',
   'GH1',
   'GH103',
   'GH23',
   'GH28',
   'GH3',
   'GH43',
   'GT2',
   'GT51',
   'GT9',
   'PL1',
   'PL2',
   'PL22',
   'PL3',
   'PL9'},
  'freqs': {426}},
 32: {'fams': {'CBM48',
   'CBM5',
   'CBM50',
   'CE1',
   'CE11',
   'CE12',
   'CE4',
   'CE8',
   'CE9',
   'GH1',
   'GH102',
   'GH103',
   'GH104',
   'GH105',
   'GH13',
   'GH19',
   'GH2',
   'GH23',
   'GH28',
   'GH3',
   'GH30',
   'GH31',
   'GH32',
   'GH33',
   'GH38',
   'GH5',
   'GH73',
   'GH77',
   'GH8',
   'GT0',
   'GT1',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT30',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT56',
   'GT81',
   'GT83',
   'GT9',
   'PL1',
   'PL2',
   'PL22',
   'PL9'},
  'freqs': {4}},
 33: {'fams': {'CBM5',
   'CBM50',
   'CE11',
   'GH23',
   'GH3',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT4',
   'GT51',
   'GT56',
   'GT9'},
  'freqs': {75}},
 34: {'fams': {'CBM32',
   'CBM5',
   'CBM50',
   'CE11',
   'CE4',
   'GH19',
   'GH23',
   'GH3',
   'GH32',
   'GH37',
   'GH68',
   'GH77',
   'GH8',
   'GT19',
   'GT2',
   'GT20',
   'GT26',
   'GT28',
   'GT4',
   'GT51',
   'GT56',
   'GT9'},
  'freqs': {39}},
 35: {'fams': {'CBM5',
   'CBM50',
   'CE11',
   'CE12',
   'CE9',
   'GH1',
   'GH102',
   'GH103',
   'GH13',
   'GH23',
   'GH28',
   'GH3',
   'GH32',
   'GH4',
   'GH68',
   'GH73',
   'GH94',
   'GT0',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT30',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT56',
   'GT8',
   'GT81',
   'GT84',
   'GT9'},
  'freqs': {33}}}
In [84]:
upsetplot_membership = []
upsetplot_membership = add_to_upsetplot_membership(upsetplot_membership, paper_cooccurring_fams)
len(upsetplot_membership)
Out[84]:
5076
In [85]:
pectobact_upsetplot = build_upsetplot(
    upsetplot_membership,
    file_path='../results/pectobact/cooccurring_families/paper-pecto-cooccurring-families.svg',
    sort_by='input',
)

Calculate the frequency of each group per genus to then build a matrix plot (or proportional area plot).

In [86]:
paper_cooccurring_freqs = []  # [fams, genus/grp, incidence/freq]
num_of_grp = 0

for grp_name in grp_order:
    for fams in grp_order[grp_name]:
        fams = fams.split("+")
        fams.sort()

        for genus in ['Soft tissue targeting', 'Pectobacterium', 'Dickeya', 'Musicola', 'Hard tissue targeting', 'Brenneria', 'Lonsdalea']:
            # {grp num: {'fams': {fams}, 'freqs': {int}}}

            for grp_num in genera_cooccuring_fams[genus]:
                grp_fams = list(genera_cooccuring_fams[genus][grp_num]['fams'])
                grp_fams.sort()

                if grp_fams == fams:
                    # found fams in genus

                    paper_cooccurring_freqs.append(
                        [
                            genera_cooccuring_fams[genus][grp_num]['fams'],
                            genus,
                            list(genera_cooccuring_fams[genus][grp_num]['freqs'])[0],
                        ]
                    )
                    
paper_cooccurring_freqs
Out[86]:
[[{'CBM48', 'CE8', 'GH13'}, 'Soft tissue targeting', 635],
 [{'PL2', 'PL22'}, 'Soft tissue targeting', 635],
 [{'PL2', 'PL22'}, 'Dickeya', 205],
 [{'CBM4', 'GH148'}, 'Soft tissue targeting', 8],
 [{'CBM4', 'GH148'}, 'Dickeya', 8],
 [{'GT30', 'PL3'}, 'Dickeya', 205],
 [{'GH88', 'PL35'}, 'Dickeya', 3],
 [{'CE11', 'GT83'}, 'Dickeya', 204],
 [{'GH16', 'GT25'}, 'Dickeya', 1],
 [{'GH19', 'GH5', 'PL4'}, 'Dickeya', 203],
 [{'GH121', 'GH146', 'GH154'}, 'Soft tissue targeting', 1],
 [{'GH121', 'GH146', 'GH154'}, 'Pectobacterium', 1],
 [{'GH105', 'GT56'}, 'Pectobacterium', 425],
 [{'CBM48', 'CE8', 'CE9', 'GH13'}, 'Pectobacterium', 425],
 [{'CE11', 'GH102', 'GH32'}, 'Pectobacterium', 425],
 [{'CBM32', 'CBM63'}, 'Musicola', 2],
 [{'GH24', 'GH36'}, 'Musicola', 2],
 [{'GH94', 'GT84'}, 'Soft tissue targeting', 241],
 [{'GH94', 'GT84'}, 'Pectobacterium', 152],
 [{'GH94', 'GT84'}, 'Dickeya', 89],
 [{'GH94', 'GT84'}, 'Hard tissue targeting', 68],
 [{'GH94', 'GT84'}, 'Lonsdalea', 33],
 [{'CBM3', 'GH5'}, 'Pectobacterium', 425],
 [{'CBM3', 'GH5'}, 'Hard tissue targeting', 25],
 [{'CBM3', 'GH5'}, 'Brenneria', 25],
 [{'GH13', 'GT30'}, 'Hard tissue targeting', 74],
 [{'GH1', 'GH73', 'GT0'}, 'Hard tissue targeting', 74],
 [{'GT35', 'GT5', 'GT8'}, 'Hard tissue targeting', 36],
 [{'GH127', 'GH15'}, 'Hard tissue targeting', 1],
 [{'GH31', 'GT81'}, 'Lonsdalea', 38],
 [{'GH8', 'GT83'}, 'Brenneria', 15],
 [{'CBM67', 'GH65'}, 'Hard tissue targeting', 1],
 [{'GT73', 'PL17'}, 'Hard tissue targeting', 1],
 [{'GT73', 'PL17'}, 'Brenneria', 1],
 [{'GH106', 'PL38'}, 'Hard tissue targeting', 1],
 [{'GH106', 'PL38'}, 'Brenneria', 1],
 [{'GH78', 'GT1'}, 'Lonsdalea', 10],
 [{'GH26', 'GH51'}, 'Lonsdalea', 9],
 [{'GH1', 'GH28', 'GH4', 'GH73', 'GT0'}, 'Lonsdalea', 38],
 [{'GH13', 'GH39', 'GT30', 'PL1', 'PL3'}, 'Lonsdalea', 38],
 [{'CBM5',
   'CBM50',
   'GH1',
   'GH103',
   'GH23',
   'GH28',
   'GH3',
   'GT2',
   'GT51',
   'GT9',
   'PL1',
   'PL9'},
  'Soft tissue targeting',
  636],
 [{'CBM48',
   'CBM5',
   'CBM50',
   'CE4',
   'CE8',
   'GH1',
   'GH103',
   'GH105',
   'GH13',
   'GH23',
   'GH28',
   'GH3',
   'GH33',
   'GH73',
   'GH77',
   'GH8',
   'GT1',
   'GT19',
   'GT2',
   'GT28',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT9',
   'PL1',
   'PL9'},
  'Dickeya',
  206],
 [{'CBM5',
   'CBM50',
   'GH1',
   'GH103',
   'GH23',
   'GH28',
   'GH3',
   'GH43',
   'GT2',
   'GT51',
   'GT9',
   'PL1',
   'PL2',
   'PL22',
   'PL3',
   'PL9'},
  'Pectobacterium',
  426],
 [{'CBM48',
   'CBM5',
   'CBM50',
   'CE1',
   'CE11',
   'CE12',
   'CE4',
   'CE8',
   'CE9',
   'GH1',
   'GH102',
   'GH103',
   'GH104',
   'GH105',
   'GH13',
   'GH19',
   'GH2',
   'GH23',
   'GH28',
   'GH3',
   'GH30',
   'GH31',
   'GH32',
   'GH33',
   'GH38',
   'GH5',
   'GH73',
   'GH77',
   'GH8',
   'GT0',
   'GT1',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT30',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT56',
   'GT81',
   'GT83',
   'GT9',
   'PL1',
   'PL2',
   'PL22',
   'PL9'},
  'Musicola',
  4],
 [{'CBM5',
   'CBM50',
   'CE11',
   'GH23',
   'GH3',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT4',
   'GT51',
   'GT56',
   'GT9'},
  'Hard tissue targeting',
  75],
 [{'CBM32',
   'CBM5',
   'CBM50',
   'CE11',
   'CE4',
   'GH19',
   'GH23',
   'GH3',
   'GH32',
   'GH37',
   'GH68',
   'GH77',
   'GH8',
   'GT19',
   'GT2',
   'GT20',
   'GT26',
   'GT28',
   'GT4',
   'GT51',
   'GT56',
   'GT9'},
  'Lonsdalea',
  39],
 [{'CBM5',
   'CBM50',
   'CE11',
   'CE12',
   'CE9',
   'GH1',
   'GH102',
   'GH103',
   'GH13',
   'GH23',
   'GH28',
   'GH3',
   'GH32',
   'GH4',
   'GH68',
   'GH73',
   'GH94',
   'GT0',
   'GT19',
   'GT2',
   'GT26',
   'GT28',
   'GT30',
   'GT35',
   'GT4',
   'GT5',
   'GT51',
   'GT56',
   'GT8',
   'GT81',
   'GT84',
   'GT9'},
  'Brenneria',
  33]]
In [87]:
# build the dataframe
cooccurring_fams_freq_df = build_upsetplot_matrix(
    paper_cooccurring_freqs,
    'Genus',
    file_path='../results/pectobact/cooccurring_families/paper-cooccurring_fams_freqs.csv',
)
cooccurring_fams_freq_df
Out[87]:
Families Genus Incidence
0 {CBM48, CE8, GH13} Soft tissue targeting 635
1 {PL2, PL22} Soft tissue targeting 635
2 {PL2, PL22} Dickeya 205
3 {GH148, CBM4} Soft tissue targeting 8
4 {GH148, CBM4} Dickeya 8
5 {GT30, PL3} Dickeya 205
6 {PL35, GH88} Dickeya 3
7 {CE11, GT83} Dickeya 204
8 {GH16, GT25} Dickeya 1
9 {PL4, GH19, GH5} Dickeya 203
10 {GH154, GH121, GH146} Soft tissue targeting 1
11 {GH154, GH121, GH146} Pectobacterium 1
12 {GH105, GT56} Pectobacterium 425
13 {GH13, CBM48, CE8, CE9} Pectobacterium 425
14 {GH102, CE11, GH32} Pectobacterium 425
15 {CBM32, CBM63} Musicola 2
16 {GH36, GH24} Musicola 2
17 {GT84, GH94} Soft tissue targeting 241
18 {GT84, GH94} Pectobacterium 152
19 {GT84, GH94} Dickeya 89
20 {GT84, GH94} Hard tissue targeting 68
21 {GT84, GH94} Lonsdalea 33
22 {CBM3, GH5} Pectobacterium 425
23 {CBM3, GH5} Hard tissue targeting 25
24 {CBM3, GH5} Brenneria 25
25 {GT30, GH13} Hard tissue targeting 74
26 {GH73, GH1, GT0} Hard tissue targeting 74
27 {GT5, GT8, GT35} Hard tissue targeting 36
28 {GH127, GH15} Hard tissue targeting 1
29 {GT81, GH31} Lonsdalea 38
30 {GH8, GT83} Brenneria 15
31 {GH65, CBM67} Hard tissue targeting 1
32 {PL17, GT73} Hard tissue targeting 1
33 {PL17, GT73} Brenneria 1
34 {GH106, PL38} Hard tissue targeting 1
35 {GH106, PL38} Brenneria 1
36 {GH78, GT1} Lonsdalea 10
37 {GH26, GH51} Lonsdalea 9
38 {GH1, GT0, GH4, GH28, GH73} Lonsdalea 38
39 {GH13, GH39, PL3, GT30, PL1} Lonsdalea 38
40 {GH1, GT2, GH3, PL1, GH28, GT51, CBM5, GH23, G... Soft tissue targeting 636
41 {GH1, GH13, GH3, CBM48, GT5, GH23, CBM50, GT1,... Dickeya 206
42 {GH1, GT2, GH3, PL1, PL22, PL3, GH28, GT51, CB... Pectobacterium 426
43 {GH1, GH13, GH3, GT26, CBM48, GT5, GH23, GT30,... Musicola 4
44 {CE11, GT2, GH3, GT26, GT4, GT51, GT56, CBM5, ... Hard tissue targeting 75
45 {GH3, GT26, GH23, CBM50, CBM32, CE11, GH37, GT... Lonsdalea 39
46 {GH1, GH13, GH3, GT26, GH4, GT5, GH23, GT30, C... Brenneria 33

6. Principal Component Analysis (PCA)¶

Use principal component analysis to identify individual and groups of CAZy families that are strongly associated with divergence between the Pectobacteriaceae genera CAZomes in terms of CAZy family frequencies.

Use the cazomevolve function perform_pca() to perform a PCA on a dataframe where each row is a genome, and each column the frequency of a unique CAZy family - the columns in the dataframe must only contain numerical data (i.e. no taxonomic data).

In [88]:
# make output dir for results, will not delete data if dir already exists
make_output_directory(Path('../results/pectobact/pca/'), force=True, nodelete=True)
# reminder of the structure
fam_freq_df.head(2)
Output directory ../results/pectobact/pca exists, nodelete is True. Adding output to output directory.
Out[88]:
Genome Genus Species AA10 AA3 CBM0 CBM13 CBM18 CBM3 CBM32 ... PL11 PL17 PL2 PL22 PL26 PL3 PL35 PL38 PL4 PL9
0 GCA_009874285.1 Dickeya dianthicola 0 0 0 0 0 0 0 ... 0 0 1 1 1 2 0 0 2 3
1 GCA_002307355.1 Pectobacterium polaris 0 1 0 1 0 1 1 ... 0 0 2 1 1 2 0 1 1 2

2 rows × 120 columns

In [89]:
num_of_components = len(fam_freq_filtered_df_ggs.columns)
pectobact_pca, X_scaled = perform_pca(fam_freq_filtered_df_ggs, num_of_components)
In [90]:
pectobact_pca
Out[90]:
PCA(n_components=117)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=117)

Explained cumulative variance:

Explore the amount of variance in the dataset that is captured by the dimensional reduction performed by the PCA.

In [91]:
cumExpVar = plot_explained_variance(
    pectobact_pca,
    num_of_components,
    file_path="../results/pectobact/pca/pca_explained_variance.png",
)
Number of features needed to explain 0.95 fraction of total variance is 59. 
In [92]:
print(f"{pectobact_pca.explained_variance_ratio_.sum() * 100}% of the variance in the data set was catpured by the PCA")
100.0% of the variance in the data set was catpured by the PCA

Variance captured per PC:

Explore the variance in the data that is captured by each PC.

In [93]:
plot_scree(pectobact_pca, nComp=10, file_path="../results/pectobact/pca/pectobact_pca_pca_scree.png")
Explained variance for 1PC: 0.1578167077631129
Explained variance for 2PC: 0.11714531745192676
Explained variance for 3PC: 0.05535353923303324
Explained variance for 4PC: 0.048047116157604305
Explained variance for 5PC: 0.04031600741870655
Explained variance for 6PC: 0.029313990173380027
Explained variance for 7PC: 0.02766383466136211
Explained variance for 8PC: 0.021653599705441427
Explained variance for 9PC: 0.021188563199185613
Explained variance for 10PC: 0.019638464318788074

PC1 (15%) and PC2 (11%) capture a signficantly greater degree of the varaince in the data set than all other PCs.
PC3 (6%) and PC4 (5%) capture comparable degrees of the variance

Scatter and loadings plots¶

To explore the variance captured by each PC, plot different combinations of PCs onto a scatter plot where each axis represents a different PC and each point on the plot is a genome in the data set, using the plot_pca() function.

plot_pca() takes 6 positional argumets:

  1. PCA object from peform_pca()
  2. Scaling object (X_scaled) from perform_pca()
  3. The dataframe of CAZy family frequencies, if you want to colour code the genomes by a specific grouping (i.e. by Genus), an additional column containing the grouping information needs to be added to the dataframe (e.g. listing the genus per genome)
  4. The number of the first PC to be plotted, e.g. 1 for PC1 - int
  5. The number of the second PC to be plotted, e.g. 2 for PC2 - int
  6. The method to colour code the genomes by (e.g. 'Genus') - needs to match the name of the column containing the data in the dataframe of CAZy family frequencies

Owing to the majoirty of the variance captured by the PCA being captured by PCs 1-4, all possible combinations of these PCs were explored.

PC1 vs PC2¶

In [94]:
fam_freq_filtered_df_ggs['Genus'] = list(fam_freq_filtered_df['Genus'])  # add column to use for colour scheme

pc1_pc2_scatter_plt = plot_pca(
    pectobact_pca,
    X_scaled,
    fam_freq_filtered_df_ggs,
    1,
    2,
    'Genus',
    style='Genus',
    figsize=(13,8),
    file_path='../results/pectobact/pca/pca_pc1_vs_pc2.png',
)
Not applying hue order
Applying style
Not applying style order

Regenerate the plot above but label the Dickeya genomes that are clustered with Musicola, and the Pectobacterium genomes that are on the PC1 +ve axis.

In [95]:
X_pca = pectobact_pca.transform(X_scaled)
plt.figure(figsize=(15,15))
sns.set(font_scale=1.15)
g = sns.scatterplot(
    x=X_pca[:,0],
    y=X_pca[:,1],
    data=fam_freq_filtered_df_ggs,
    hue='Genus',
    style='Genus',
    s=100,
    markers=True,
)

g.axhline(0, linestyle='--', color='grey', linewidth=1.25);
g.axvline(0, linestyle='--', color='grey', linewidth=1.25);

plt.ylabel(f"PC2 {100 * pectobact_pca.explained_variance_ratio_[1]:.2f}%");
plt.xlabel(f"PC1 {100 * pectobact_pca.explained_variance_ratio_[0]:.2f}%");
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

genome_lbls = ["-".join(_) for _ in fam_freq_df_ggs.index]
x_vals = X_pca[:,0]
y_vals = X_pca[:,1]

texts = [
    plt.text(
        xval,
        yval,
        lbl,
        ha='center',
        va='center',
        fontsize=12,
    ) for (xval, yval, lbl) in zip(
        x_vals, y_vals, genome_lbls
    ) if ((xval > 2) and (yval < 3.5) and (yval > 0) and (xval < 4)) or ((xval > 0.1) and (xval < 2.5) and (yval < 0))
]
adjustText.adjust_text(texts, arrowprops=dict(arrowstyle='-', color='black'));

plt.savefig('../results/pectobact/pca/pca_pc1_vs_pc2_musicola_annotated.png', bbox_inches='tight', format='png')

Build the loadings plot for the scatter plot.

In [96]:
plot_loadings(
    pectobact_pca,
    fam_freq_filtered_df_ggs,
    1,
    2,
    threshold=0.3,
    fig_size=(10,10),
    file_path="../results/pectobact/pca/pc1_pc2_loadings_plot",
    font_size=11,
)

PC1 vs PC3¶

In [102]:
pc1_pc3_scatter_plt = plot_pca(
    pectobact_pca,
    X_scaled,
    fam_freq_filtered_df_ggs,
    1,
    3,
    'Genus',
    style='Genus',
    figsize=(10,10),
    marker_size=50,
)
Not applying hue order
Applying style
Not applying style order

PC1 vs PC4¶

In [103]:
pc1_pc4_scatter_plt = plot_pca(
    pectobact_pca,
    X_scaled,
    fam_freq_filtered_df_ggs,
    1,
    4,
    'Genus',
    style='Genus',
    figsize=(10,10),
    marker_size=50,
)
Not applying hue order
Applying style
Not applying style order

PC2, PC3 and PC4¶

In [104]:
pc2_pc3_scatter_plt = plot_pca(
    pectobact_pca,
    X_scaled,
    fam_freq_filtered_df_ggs,
    2,
    3,
    'Genus',
    style='Genus',
    figsize=(10,10),
    marker_size=50,
)
Not applying hue order
Applying style
Not applying style order
In [105]:
pc2_pc4_scatter_plt = plot_pca(
    pectobact_pca,
    X_scaled,
    fam_freq_filtered_df_ggs,
    2,
    4,
    'Genus',
    style='Genus',
    figsize=(10,10),
    marker_size=50,
)
Not applying hue order
Applying style
Not applying style order
In [106]:
pc3_pc4_scatter_plt = plot_pca(
    pectobact_pca,
    X_scaled,
    fam_freq_filtered_df_ggs,
    3,
    4,
    'Genus',
    style='Genus',
    figsize=(10,10),
    marker_size=50,
)
Not applying hue order
Applying style
Not applying style order

PC1 separates out the genomes in a manner that correlates with their genus classification: Pectobacterium genomes are locataed in the negative PC1 axis, and Dickeya genomes are located in the positive PC1 axis.

PCs 2-4 do not correlate with the genus classification.